CUDA error: an illegal memory access was encountered with Qwen 3.5 27b split across Nvidia GPUs

**Describe the Issue**
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_device_event_synchronize at ggml/src/ggml-cuda/ggml-cuda.cu:4947
  cudaEventSynchronize((cudaEvent_t)event-&gt;context)
ggml/src/ggml-cuda/ggml-cuda.cu:99: CUDA error
Aborted (core dumped)

This issue only seems to effect the Qwen 3.5 series (specifically the 27b model) when it is split across two 3090 Nvidia GPUs. If I don't split the model it doesn't crash, and if I split any of my other models the software doesn't crash. 

Tested Multiple Versions
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-UD-Q8_K_XL.gguf (split crashes)
https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/blob/main/Qwen_Qwen3.5-27B-Q8_0.gguf (split crashes)
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-Q5_K_M.gguf (split crashes, not split works fine)

**Additional Information:**
koboldcpp-1.108.2
PROCESSOR QEMU Standard PC _Q35 + ICH9, 2009_ (Running in a VM with GPU Passthrough)
RAM 109.4 GiB
13th Gen Intel Core\u2122 i9-13900K 25
NVIDIA Corporation GA102 [GeForce RTX 3090] / NVIDIA Corporation GA102 [GeForce RTX 3090]
Pop!_OS 22.04 LTS 64 Bit
Gnome 42.9 X11
nvidia-driver-580 (Original version was probably nvidia-driver-47x... I updated after this error and still had a similar outcome)

<pre>./koboldcpp-1.108.2 
***
Welcome to KoboldCpp - Version 1.108.2
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend (flag=0)

Loading Chat Completions Adapter: /tmp/_MEIUnX178/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
Auto Recommended GPU Layers: 47
GPU layers is default: Will enable AutoFit for increased estimation accuracy.
System: Linux #202510191616~1762410050~22.04~898873a SMP PREEMPT_DYNAMIC Thu N x86_64 x86_64
Detected Available GPU Memory: 24576 MB
Detected Available RAM: 97187 MB
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(admin=False, admindir=&apos;&apos;, adminpassword=&apos;&apos;, analyze=&apos;&apos;, autofit=True, autofitpadding=1024, batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter=&apos;AutoGuess&apos;, cli=False, config=None, contextsize=16384, debugmode=0, defaultgenamt=896, device=&apos;&apos;, downloaddir=&apos;&apos;, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel=&apos;&apos;, enableguidance=False, exportconfig=&apos;&apos;, exporttemplate=&apos;&apos;, failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults=&apos;&apos;, gendefaultsoverwrite=False, genlimit=0, gpulayers=47, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey=&apos;&apos;, hordemaxctx=0, hordemodelname=&apos;&apos;, hordeworkername=&apos;&apos;, host=&apos;&apos;, ignoremissing=False, jinja=False, jinja_tools=False, launch=True, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mcpfile=None, mmproj=None, mmprojcpu=False, model=[], model_param=&apos;/home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf&apos;, moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, noflashattention=False, nommap=False, nomodel=False, nopipelineparallel=False, noshift=False, onready=&apos;&apos;, overridekv=None, overridenativecontext=0, overridetensors=None, password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt=&apos;&apos;, quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclip1=&apos;&apos;, sdclip2=&apos;&apos;, sdclipgpu=False, sdconfig=None, sdconvdirect=&apos;off&apos;, sdflashattention=False, sdgendefaults=False, sdlora=None, sdloramult=1.0, sdmodel=&apos;&apos;, sdnotile=False, sdoffloadcpu=False, sdphotomaker=&apos;&apos;, sdquant=0, sdt5xxl=&apos;&apos;, sdthreads=11, sdtiledvae=768, sdupscaler=&apos;&apos;, sdvae=&apos;&apos;, sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=11, ttsgpu=False, ttsmaxlen=4096, ttsmodel=&apos;&apos;, ttsthreads=0, ttswavtokenizer=&apos;&apos;, unpack=&apos;&apos;, usecpu=False, usecuda=[&apos;normal&apos;, &apos;mmq&apos;], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel=&apos;&apos;)
==========
Loading Text Model: /home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf

The reported GGUF Arch is: qwen35
Arch Category: 0

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they&apos;ll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
CUDA MMQ: True
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

Attempting to use llama.cpp&apos;s automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
Autofit Success: 1, Autofit Result: -c 16512 -ngl -1
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:05:00.0) - 23837 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:07:00.0) - 23845 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from /home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 18.06 GiB (5.77 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 248075 &apos;&lt;tts_text_bos_single&gt;&apos; is not marked as EOG
load: control token: 248073 &apos;&lt;tts_text_bos&gt;&apos; is not marked as EOG
load: control token: 248072 &apos;&lt;tts_pad&gt;&apos; is not marked as EOG
load: control token: 248071 &apos;&lt;|audio_end|&gt;&apos; is not marked as EOG
load: control token: 248061 &apos;&lt;|fim_middle|&gt;&apos; is not marked as EOG
load: control token: 248055 &apos;&lt;|vision_pad|&gt;&apos; is not marked as EOG
load: control token: 248052 &apos;&lt;|quad_end|&gt;&apos; is not marked as EOG
load: control token: 248049 &apos;&lt;|box_start|&gt;&apos; is not marked as EOG
load: control token: 248048 &apos;&lt;|object_ref_end|&gt;&apos; is not marked as EOG
load: control token: 248045 &apos;&lt;|im_start|&gt;&apos; is not marked as EOG
load: control token: 248057 &apos;&lt;|video_pad|&gt;&apos; is not marked as EOG
load: control token: 248070 &apos;&lt;|audio_start|&gt;&apos; is not marked as EOG
load: control token: 248056 &apos;&lt;|image_pad|&gt;&apos; is not marked as EOG
load: control token: 248054 &apos;&lt;|vision_end|&gt;&apos; is not marked as EOG
load: control token: 248060 &apos;&lt;|fim_prefix|&gt;&apos; is not marked as EOG
load: control token: 248050 &apos;&lt;|box_end|&gt;&apos; is not marked as EOG
load: control token: 248074 &apos;&lt;tts_text_eod&gt;&apos; is not marked as EOG
load: control token: 248053 &apos;&lt;|vision_start|&gt;&apos; is not marked as EOG
load: control token: 248062 &apos;&lt;|fim_suffix|&gt;&apos; is not marked as EOG
load: control token: 248047 &apos;&lt;|object_ref_start|&gt;&apos; is not marked as EOG
load: control token: 248051 &apos;&lt;|quad_start|&gt;&apos; is not marked as EOG
load: control token: 248076 &apos;&lt;|audio_pad|&gt;&apos; is not marked as EOG
load: setting token &apos;&lt;/think&gt;&apos; (248069) attribute to USER_DEFINED (16), old attributes: 16
load: setting token &apos;&lt;think&gt;&apos; (248068) attribute to USER_DEFINED (16), old attributes: 16
load: printing all EOG tokens:
load:   - 248044 (&apos;&lt;|endoftext|&gt;&apos;)
load:   - 248046 (&apos;&lt;|im_end|&gt;&apos;)
load:   - 248063 (&apos;&lt;|fim_pad|&gt;&apos;)
load:   - 248064 (&apos;&lt;|repo_name|&gt;&apos;)
load:   - 248065 (&apos;&lt;|file_sep|&gt;&apos;)
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 64
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = ?B
print_info: model params          = 26.90 B
print_info: general.name          = Qwen3.5 27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 &apos;,&apos;
print_info: EOS token             = 248046 &apos;&lt;|im_end|&gt;&apos;
print_info: EOT token             = 248046 &apos;&lt;|im_end|&gt;&apos;
print_info: PAD token             = 248044 &apos;&lt;|endoftext|&gt;&apos;
print_info: LF token              = 198 &apos;Ċ&apos;
print_info: FIM PRE token         = 248060 &apos;&lt;|fim_prefix|&gt;&apos;
print_info: FIM SUF token         = 248062 &apos;&lt;|fim_suffix|&gt;&apos;
print_info: FIM MID token         = 248061 &apos;&lt;|fim_middle|&gt;&apos;
print_info: FIM PAD token         = 248063 &apos;&lt;|fim_pad|&gt;&apos;
print_info: FIM REP token         = 248064 &apos;&lt;|repo_name|&gt;&apos;
print_info: FIM SEP token         = 248065 &apos;&lt;|file_sep|&gt;&apos;
print_info: EOG token             = 248044 &apos;&lt;|endoftext|&gt;&apos;
print_info: EOG token             = 248046 &apos;&lt;|im_end|&gt;&apos;
print_info: EOG token             = 248063 &apos;&lt;|fim_pad|&gt;&apos;
print_info: EOG token             = 248064 &apos;&lt;|repo_name|&gt;&apos;
print_info: EOG token             = 248065 &apos;&lt;|file_sep|&gt;&apos;
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: relocated tensors: 1 of 851
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:          CPU model buffer size =   833.59 MiB
load_tensors:        CUDA0 model buffer size =  8591.18 MiB
load_tensors:        CUDA1 model buffer size =  9065.65 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
...............................................load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
.............................................

MRope is used, context shift will be disabled!
Automatic RoPE Scaling: Using model internal value.
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16640
llama_context: n_ctx_seq     = 16640
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (16640) &lt; n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:      CUDA0 KV buffer size =   520.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   520.00 MiB
llama_kv_cache: size = 1040.00 MiB ( 16640 cells,  16 layers,  1/1 seqs), K (f16):  520.00 MiB, V (f16):  520.00 MiB
llama_memory_recurrent: layer   3: skipped
llama_memory_recurrent: layer   7: skipped
llama_memory_recurrent: layer  11: skipped
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent: layer  19: skipped
llama_memory_recurrent: layer  23: skipped
llama_memory_recurrent: layer  27: skipped
llama_memory_recurrent: layer  31: skipped
llama_memory_recurrent: layer  35: skipped
llama_memory_recurrent: layer  39: skipped
llama_memory_recurrent: layer  43: skipped
llama_memory_recurrent: layer  47: skipped
llama_memory_recurrent: layer  51: skipped
llama_memory_recurrent: layer  55: skipped
llama_memory_recurrent: layer  59: skipped
llama_memory_recurrent: layer  63: skipped
llama_memory_recurrent:      CUDA0 RS buffer size =    77.93 MiB
llama_memory_recurrent:      CUDA1 RS buffer size =    71.70 MiB
llama_memory_recurrent: size =  149.62 MiB (     1 cells,  64 layers,  1 seqs), R (f32):    5.62 MiB, S (f32):  144.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: max_nodes = 27232
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
sched_reserve:      CUDA0 compute buffer size =   338.43 MiB
sched_reserve:      CUDA1 compute buffer size =   954.32 MiB
sched_reserve:  CUDA_Host compute buffer size =   150.08 MiB
sched_reserve: graph nodes  = 12729 (with bs=512), 4713 (with bs=1)
sched_reserve: graph splits = 6 (with bs=512), 5 (with bs=1)
sched_reserve: reserve took 62.12 ms, sched copies = 4
Threadpool set to 11 threads and 11 blasthreads...
attach_threadpool: call

This architecture has explicitly disabled the BOS token - if you need it, you must add it manually.
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat completion heuristic: ChatML (Generic)
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl MCPBridge
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
======
Please connect to custom endpoint at http://localhost:5001

Input: {&quot;n&quot;: 1, &quot;max_context_length&quot;: 16384, &quot;max_length&quot;: 4096, &quot;rep_pen&quot;: 1.05, &quot;temperature&quot;: 0.75, &quot;top_p&quot;: 0.92, &quot;top_k&quot;: 100, &quot;top_a&quot;: 0, &quot;typical&quot;: 1, &quot;tfs&quot;: 1, &quot;rep_pen_range&quot;: 360, &quot;rep_pen_slope&quot;: 0.7, &quot;sampler_order&quot;: [6, 0, 1, 3, 4, 2, 5], &quot;memory&quot;: &quot;&quot;, &quot;trim_stop&quot;: true, &quot;genkey&quot;: &quot;KCPP8731&quot;, &quot;min_p&quot;: 0, &quot;dynatemp_range&quot;: 0, &quot;dynatemp_exponent&quot;: 1, &quot;smoothing_factor&quot;: 0, &quot;smoothing_curve&quot;: 1, &quot;nsigma&quot;: 0, &quot;banned_tokens&quot;: [], &quot;render_special&quot;: false, &quot;logprobs&quot;: false, &quot;replace_instruct_placeholders&quot;: true, &quot;presence_penalty&quot;: 0, &quot;logit_bias&quot;: {}, &quot;adaptive_target&quot;: -1, &quot;adaptive_decay&quot;: 0.9, &quot;stop_sequence&quot;: [&quot;{{[INPUT]}}&quot;, &quot;{{[OUTPUT]}}&quot;], &quot;use_default_badwordsids&quot;: false, &quot;bypass_eos&quot;: false, &quot;prompt&quot;: &quot;{{[INPUT]}} INPUT REACTED {{[OUTPUT]}}&quot;}

Processing Prompt [BATCH] (512 / 1049 tokens)CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_device_event_synchronize at ggml/src/ggml-cuda/ggml-cuda.cu:4947
  cudaEventSynchronize((cudaEvent_t)event-&gt;context)
ggml/src/ggml-cuda/ggml-cuda.cu:99: CUDA error
Aborted (core dumped)
</pre>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: an illegal memory access was encountered with Qwen 3.5 27b split across Nvidia GPUs #2005

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA error: an illegal memory access was encountered with Qwen 3.5 27b split across Nvidia GPUs #2005

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions