Add SYCL Kernels for XPU backend #1679

xiaolil1 · 2025-06-15T16:41:53Z

This is the pull request for the SYCL Kernels targeting the XPU backend.

It features the implementation of the "dequantize_blockwise," "dequantize_4bit," and "dequant & gemv_4bit fusion" kernels.
The target low-precision quantization datatypes encompass NF4, FP4 and General8bits.
This PR aims to eliminate the dependency on IPEX and improve the performance.

fix transpose

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

revert cpu changes

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

remove check for better performance

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix doc

fengyuan14 · 2025-06-18T01:15:37Z

Can we use a more accurate title for the commit? or reviewers would get confused if all SYCL kernels are included in the PR.

csrc/xpu_kernels.cpp

csrc/xpu_ops.cpp

csrc/xpu_ops.h

xiaolil1 · 2025-06-30T03:29:42Z

Can we use a more accurate title for the commit? or reviewers would get confused if all SYCL kernels are included in the PR.

This is the first PR for SYCL kernels targeting QLoRA, I have added detailed description.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix xpu log

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Remove ipex entirely

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix lint

Egor-Krivov · 2025-07-04T08:47:29Z

@xiaolil1

When I tried to compile it, I had issues with sycl::and_range and sycl:and_item. Are you sure it's not sycl::nd_range and sycl::nd_item?

https://github.khronos.org/SYCL_Reference/iface/nd_range.html

https://github.khronos.org/SYCL_Reference/iface/nd_item.html

Egor-Krivov · 2025-07-04T09:58:51Z

I replaced types as described above and tested implementation.

In my experiment SYCL implementation was about 2x faster for token generation than triton. I guess due to fused dequant + matmul. Triton compiler currently have an issue with that: intel/intel-xpu-backend-for-triton#4327.

However, some tests failed BNB_TEST_DEVICE="xpu" pytest -q --tb=short --ignore test_optim.py --ignore test_triton.py --ignore test_cuda_setup_evaluator.py

============================================ FAILURES =============================================
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-uint8-fp16-fc2-nf4-DQ_True-xpu] ________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-fp16-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-bf16-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-fp32-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-uint8-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-uint8-bf16-attn_packed-nf4-DQ_False-xpu] ___
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp16-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp16-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-bf16-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-bf16-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp32-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp32-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)

* fix logs Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

xiaolil1 · 2025-07-07T02:15:36Z

@xiaolil1

When I tried to compile it, I had issues with sycl::and_range and sycl:and_item. Are you sure it's not sycl::nd_range and sycl::nd_item?

https://github.khronos.org/SYCL_Reference/iface/nd_range.html

https://github.khronos.org/SYCL_Reference/iface/nd_item.html

Hi @Egor-Krivov ,
Yes, you are right, it shouldn't be "sycl::nd_range/nd_item" instead of "and_range/and_item".
The typo error caused by the pre-commit issue, which changed the code unexpectedly.
Thanks for the reminder and we will fix it.

jiqing-feng · 2025-07-07T02:55:25Z

Hi @Egor-Krivov . Could you share your device name? I can pass all tests on Intel(R) Data Center GPU Max 1550.
= 2362 passed, 1540 skipped, 184 deselected, 24 xfailed, 31 warnings in 1081.07s (0:18:01) =

* fix sycl nd Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Egor-Krivov · 2025-07-07T09:33:02Z

Hi @Egor-Krivov . Could you share your device name? I can pass all tests on Intel(R) Data Center GPU Max 1550. = 2362 passed, 1540 skipped, 184 deselected, 24 xfailed, 31 warnings in 1081.07s (0:18:01) =

(triton) (base) jovyan@jupyter-ekrivov:~/triton/unsloth$ sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Data Center GPU Max 1100 12.60.7 [1.3.27642]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6438Y+ OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.43.27642.69]

Egor-Krivov · 2025-07-07T10:03:13Z

bitsandbytes/backends/xpu/ops.py

+
+
+# SYCL should be faster for xpu, so at first checking if it is available.
+if not isinstance(lib, ErrorHandlerMockBNBNativeLibrary):


Currently you either pick all methods from SYCL or all methods from triton. However, sycl implementation right now is missing these methods, available in triton:

quantize_blockwize quantize_4bit

I suggest we keep using these triton methods even with SYCL, since that's the only option on XPU for new.

These two kernels don't affect the performance of QLoRA, they are now default running with pytorch ops and we will implemented them with SYCL kernel later.

Egor-Krivov · 2025-07-07T10:16:29Z

The implementation is missing following methods:

void cgemm_4bit_inference_naive_fp16(
    int m, int n, int k, half* A, unsigned char* B, float* absmax, float* datatype, half* out, int lda, int ldb,
    int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_fp16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

void cgemm_4bit_inference_naive_bf16(
    int m, int n, int k, __nv_bfloat16* A, unsigned char* B, float* absmax, float* datatype, __nv_bfloat16* out,
    int lda, int ldb, int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_bf16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

void cgemm_4bit_inference_naive_fp32(
    int m, int n, int k, float* A, unsigned char* B, float* absmax, float* datatype, float* out, int lda, int ldb,
    int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_fp32(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

cgemm_4bit_inference_naive_bf16 is used for text generation in the basic unsloth tutorial. Here is a stack trace:

Traceback (most recent call last):
  File "/home/jovyan/triton/unsloth/bench_unsloth.py", line 150, in <module>
    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/peft/peft_model.py", line 1968, in generate
    outputs = self.base_model.generate(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1821, in unsloth_fast_generate
    output = self._old_generate(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/transformers/generation/utils.py", line 2625, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/transformers/generation/utils.py", line 3609, in _sample
    outputs = model_forward(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1253, in _CausalLM_fast_forward
    outputs = fast_forward_inference(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1186, in LlamaModel_fast_forward_inference_custom
    X, present_key_value = attention_fast_forward_inference(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 264, in LlamaAttention_fast_forward_inference
    Qn = fast_linear_forward(self.q_proj, Xn, out = self.temp_QA[0])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/kernels/utils.py", line 637, in fast_linear_forward
    out = fast_gemv(X, W, W_quant, out = out)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/kernels/utils.py", line 483, in fast_gemv
    fx(m, n, k, get_ptr(X), get_ptr(W), get_ptr(absmax), get_ptr(stats), get_ptr(out),
  File "/home/jovyan/triton/bitsandbytes/bitsandbytes/cextension.py", line 60, in throw_on_call
    raise RuntimeError(
RuntimeError: Method 'cgemm_4bit_inference_naive_bf16' not available in CPU-only version of bitsandbytes.
Reinstall with GPU support or use CUDA-enabled hardware.

jiqing-feng · 2025-07-08T01:13:17Z

Hi @Egor-Krivov . Could you share your script to get this error?

xiaolil1 · 2025-07-08T05:30:27Z

The implementation is missing following methods:

void cgemm_4bit_inference_naive_fp16(
    int m, int n, int k, half* A, unsigned char* B, float* absmax, float* datatype, half* out, int lda, int ldb,
    int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_fp16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

void cgemm_4bit_inference_naive_bf16(
    int m, int n, int k, __nv_bfloat16* A, unsigned char* B, float* absmax, float* datatype, __nv_bfloat16* out,
    int lda, int ldb, int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_bf16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

void cgemm_4bit_inference_naive_fp32(
    int m, int n, int k, float* A, unsigned char* B, float* absmax, float* datatype, float* out, int lda, int ldb,
    int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_fp32(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

cgemm_4bit_inference_naive_bf16 is used for text generation in the basic unsloth tutorial. Here is a stack trace:

Traceback (most recent call last):
  File "/home/jovyan/triton/unsloth/bench_unsloth.py", line 150, in <module>
    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/peft/peft_model.py", line 1968, in generate
    outputs = self.base_model.generate(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1821, in unsloth_fast_generate
    output = self._old_generate(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/transformers/generation/utils.py", line 2625, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/transformers/generation/utils.py", line 3609, in _sample
    outputs = model_forward(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1253, in _CausalLM_fast_forward
    outputs = fast_forward_inference(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1186, in LlamaModel_fast_forward_inference_custom
    X, present_key_value = attention_fast_forward_inference(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 264, in LlamaAttention_fast_forward_inference
    Qn = fast_linear_forward(self.q_proj, Xn, out = self.temp_QA[0])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/kernels/utils.py", line 637, in fast_linear_forward
    out = fast_gemv(X, W, W_quant, out = out)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/kernels/utils.py", line 483, in fast_gemv
    fx(m, n, k, get_ptr(X), get_ptr(W), get_ptr(absmax), get_ptr(stats), get_ptr(out),
  File "/home/jovyan/triton/bitsandbytes/bitsandbytes/cextension.py", line 60, in throw_on_call
    raise RuntimeError(
RuntimeError: Method 'cgemm_4bit_inference_naive_bf16' not available in CPU-only version of bitsandbytes.
Reinstall with GPU support or use CUDA-enabled hardware.

@Egor-Krivov , these kernels have been implemented already.
They are "

The implementation is missing following methods:

void cgemm_4bit_inference_naive_fp16(
    int m, int n, int k, half* A, unsigned char* B, float* absmax, float* datatype, half* out, int lda, int ldb,
    int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_fp16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

void cgemm_4bit_inference_naive_bf16(
    int m, int n, int k, __nv_bfloat16* A, unsigned char* B, float* absmax, float* datatype, __nv_bfloat16* out,
    int lda, int ldb, int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_bf16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

void cgemm_4bit_inference_naive_fp32(
    int m, int n, int k, float* A, unsigned char* B, float* absmax, float* datatype, float* out, int lda, int ldb,
    int ldc, int blocksize, cudaStream_t stream
) {
    gemm_4bit_inference_naive_fp32(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

cgemm_4bit_inference_naive_bf16 is used for text generation in the basic unsloth tutorial. Here is a stack trace:

Traceback (most recent call last):
  File "/home/jovyan/triton/unsloth/bench_unsloth.py", line 150, in <module>
    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/peft/peft_model.py", line 1968, in generate
    outputs = self.base_model.generate(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1821, in unsloth_fast_generate
    output = self._old_generate(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/transformers/generation/utils.py", line 2625, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/transformers/generation/utils.py", line 3609, in _sample
    outputs = model_forward(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/.conda/envs/unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1253, in _CausalLM_fast_forward
    outputs = fast_forward_inference(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 1186, in LlamaModel_fast_forward_inference_custom
    X, present_key_value = attention_fast_forward_inference(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/models/llama.py", line 264, in LlamaAttention_fast_forward_inference
    Qn = fast_linear_forward(self.q_proj, Xn, out = self.temp_QA[0])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/kernels/utils.py", line 637, in fast_linear_forward
    out = fast_gemv(X, W, W_quant, out = out)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/triton/unsloth/unsloth/kernels/utils.py", line 483, in fast_gemv
    fx(m, n, k, get_ptr(X), get_ptr(W), get_ptr(absmax), get_ptr(stats), get_ptr(out),
  File "/home/jovyan/triton/bitsandbytes/bitsandbytes/cextension.py", line 60, in throw_on_call
    raise RuntimeError(
RuntimeError: Method 'cgemm_4bit_inference_naive_bf16' not available in CPU-only version of bitsandbytes.
Reinstall with GPU support or use CUDA-enabled hardware.

@Egor-Krivov, these kernels already implemented with SYCL kernel.
For the "gemm_4bit_inference", you need to call "cgemv_4bit_inference_**"
You can refer to the kernel dispatch in "csrc/pythonInterface.cpp"
void cgemv_4bit_inference_fp16(
int m, int n, int k, sycl::half * A, unsigned char* B, float *absmax, float datatype, sycl::half * out,
int lda, int ldb, int ldc, int blocksize, sycl::queue stream
) {
gemv_4bit_inference_fp16(m, n, k, A, B, absmax, datatype, out, lda, ldb, ldc, blocksize, stream);
}

jiqing-feng · 2025-07-08T06:40:01Z

Hi @matthewdouglas . Could you please trigger the CI for this PR? Thanks!

xiaolil1 · 2025-07-08T06:46:08Z

This PR is ready for review now, please reach us if there is any other question, thanks!

Egor-Krivov · 2025-07-08T07:54:59Z

Hi @Egor-Krivov . Could you share your script to get this error?

I'm working on performance testing of unsloth right now.

These methods are used for CUDA implementation here:
https://github.com/unslothai/unsloth/blob/6ac4e2e36f2f8bd0bc63a6eb85afa7097948ff3d/unsloth/kernels/utils.py#L173
For XPU we will need to provide implementation as well, I think.

I am working with POC branch (not merged to upstream) from https://github.com/leizhenyuan/unsloth/blob/7bed913255f611e220c2d219ee988c179ed98033/unsloth/kernels/utils.py#L154
In the POC branch this method can be called.

For me the call happens in the last 2 lines of my script, which is essentially a copy of unsloth tutorial:

from unsloth import FastLanguageModel
import torch
import time

device = 'xpu:0'

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # EGOR
    device_map={"": device}, # Use this to set the device for the model
    # attn_implementation="eager",
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

model = model.to(device)

# import pdb
# pdb.set_trace()
# model_devices = set(model.hf_device_map.values())
# model.hf_device_map = {0: 'xpu'}


from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")



from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)


from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq, TrainerCallback

# from bench_tools import tm
import pprint
from collections import defaultdict

class LatencyCallback(TrainerCallback):
    def __init__(self):
        self.step_start_time = None
    
    def on_step_begin(self, args, state, control, **kwargs):
        self.step_start_time = time.time()
    
    def on_step_end(self, args, state, control, **kwargs):
        if self.step_start_time is not None:
            step_latency = time.time() - self.step_start_time
            print(f"Step {state.global_step}: Latency = {step_latency:.4f} seconds")
            print()
        # self.times.append(get_time())
        # if len(self.times) > 1:
        #     print("Token latency: {:.1f} ms".format(1000 * (self.times[-1] - self.times[-2])))

        # if len(self.times) % 10 == 3 and self.print_median:
        #     ts = np.array(self.times)
        #     diff = ts[1:] - ts[:-1]
        #     # print("Token latency:", 1000 * diff, "ms")
        #     print("Token latency median:", np.median(1000 * diff), "ms")
        # print("Total accumulators:", {k: 1000* sum(v) for k, v in self.acc.items()}, "ms")
        # import pdb
        # pdb.set_trace()
        # print("Total accumulators:")# , {k: 1000* v for k, v in tm.get_results().items()}, "ms")
        # results = tm.get_results()
        # results = {k: f'{1000 * v:.2f}ms' for k, v in results.items()}
        # pprint.pprint(results)
        # tm.reset()

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    callbacks = [LatencyCallback()],
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        # optim = "adamw_8bit",
        optim = "adamw_torch",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

print(tokenizer.decode(trainer.train_dataset[5]["input_ids"]))

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
print(tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]]))

trainer_stats = trainer.train()


from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to(device)

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

github-actions · 2025-07-08T14:44:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jiqing-feng · 2025-07-09T01:46:29Z

Hi @matthewdouglas . The lint test failed with error fix. See this comment. Do you know how to skip xpu kernels on typo test?

* skip test for xpu ops Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix lint Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * skip typo for xpu Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * skip Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * skip Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

xiaolil1 and others added 19 commits June 15, 2025 16:08

Add SYCL Kernels for XPU backend

dd7b173

Merge pull request #1 from xiaolil1/jiqing

df93cdd

fix transpose

fix transpose

872aa02

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix log and format

04437a3

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

revert cpu changes

d585bea

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

clean ipex_xpu

1781611

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

clean ipex import

c982781

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix ipex cpu import

a4c5f8c

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix typo

4f076bb

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix comments

76d7178

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge pull request #2 from xiaolil1/jiqing

b31ea62

revert cpu changes

refine gemv_4bit kernel

452aa84

Merge branch 'main' into main

e8ac8b5

enable FP4 for dequant_4bit and gemv_4bit

8620a95

refine FP4 dequantization performance

00f064b

remove check for better performance

d60750f

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge pull request #3 from xiaolil1/jiqing

59f2aa8

remove check for better performance

fix doc

aad358f

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge pull request #4 from xiaolil1/jiqing

45e4451

fix doc

matthewdouglas added Low Priority (will be worked on after all priority issues) Intel labels Jun 17, 2025

matthewdouglas self-assigned this Jun 17, 2025

matthewdouglas self-requested a review June 17, 2025 16:19

matthewdouglas added this to the v0.48.0 milestone Jun 17, 2025

fengyuan14 reviewed Jun 18, 2025

View reviewed changes

csrc/xpu_kernels.cpp Outdated Show resolved Hide resolved

fengyuan14 reviewed Jun 18, 2025

View reviewed changes

csrc/xpu_kernels.cpp Outdated Show resolved Hide resolved

fengyuan14 reviewed Jun 18, 2025

View reviewed changes

csrc/xpu_kernels.cpp Outdated Show resolved Hide resolved

fengyuan14 reviewed Jun 18, 2025

View reviewed changes

csrc/xpu_ops.cpp Outdated Show resolved Hide resolved

fengyuan14 reviewed Jun 18, 2025

View reviewed changes

csrc/xpu_ops.h Outdated Show resolved Hide resolved

This was referenced Jun 30, 2025

[Intel] enable intel with QLoRA support leizhenyuan/unsloth#13

Open

[Intel] Enable Intel GPU with QLoRA support unslothai/unsloth#2840

Draft

jiqing-feng and others added 9 commits July 1, 2025 13:59

Merge branch 'main' into main

041b442

Merge branch 'main' into main

aa0cf92

fix xpu log

b3db4bf

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge pull request #9 from xiaolil1/jiqing

d66f93d

fix xpu log

remove ipex entirely

5bf3159

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix cpu int8 CB

005a63c

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge pull request #10 from xiaolil1/jiqing

683f37c

Remove ipex entirely

fix lint

223d7d7

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge pull request #11 from xiaolil1/jiqing

dc75ad8

fix lint

fix logs (#12)

883d693

* fix logs Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Fix sycl lint error and tests (#13)

732022d

* fix sycl nd Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Egor-Krivov reviewed Jul 7, 2025

View reviewed changes

jiqing-feng added 2 commits July 9, 2025 10:02

Merge branch 'main' into main

c42a38f



		# SYCL should be faster for xpu, so at first checking if it is available.
		if not isinstance(lib, ErrorHandlerMockBNBNativeLibrary):

Add SYCL Kernels for XPU backend #1679

Are you sure you want to change the base?

Add SYCL Kernels for XPU backend #1679

Conversation

xiaolil1 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuan14 commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaolil1 commented Jun 30, 2025

Uh oh!

Egor-Krivov commented Jul 4, 2025

Uh oh!

Egor-Krivov commented Jul 4, 2025

Uh oh!

xiaolil1 commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Egor-Krivov commented Jul 7, 2025

Uh oh!

Egor-Krivov Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaolil1 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Egor-Krivov commented Jul 7, 2025

Uh oh!

jiqing-feng commented Jul 8, 2025

Uh oh!

xiaolil1 commented Jul 8, 2025

Uh oh!

jiqing-feng commented Jul 8, 2025

Uh oh!

xiaolil1 commented Jul 8, 2025

Uh oh!

Egor-Krivov commented Jul 8, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

jiqing-feng commented Jul 9, 2025

Uh oh!

Uh oh!

xiaolil1 commented Jun 15, 2025 •

edited

Loading

xiaolil1 commented Jul 7, 2025 •

edited

Loading

jiqing-feng commented Jul 7, 2025 •

edited

Loading

Egor-Krivov Jul 7, 2025 •

edited

Loading