Skip to content

CUDA error: an illegal memory access was encountered with Qwen 3.5 27b split across Nvidia GPUs #2005

@aarongerber

Description

@aarongerber

Describe the Issue
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_device_event_synchronize at ggml/src/ggml-cuda/ggml-cuda.cu:4947
cudaEventSynchronize((cudaEvent_t)event->context)
ggml/src/ggml-cuda/ggml-cuda.cu:99: CUDA error
Aborted (core dumped)

This issue only seems to effect the Qwen 3.5 series (specifically the 27b model) when it is split across two 3090 Nvidia GPUs. If I don't split the model it doesn't crash, and if I split any of my other models the software doesn't crash.

Tested Multiple Versions
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-UD-Q8_K_XL.gguf (split crashes)
https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/blob/main/Qwen_Qwen3.5-27B-Q8_0.gguf (split crashes)
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-Q5_K_M.gguf (split crashes, not split works fine)

Additional Information:
koboldcpp-1.108.2
PROCESSOR QEMU Standard PC Q35 + ICH9, 2009 (Running in a VM with GPU Passthrough)
RAM 109.4 GiB
13th Gen Intel Core\u2122 i9-13900K 25
NVIDIA Corporation GA102 [GeForce RTX 3090] / NVIDIA Corporation GA102 [GeForce RTX 3090]
Pop!_OS 22.04 LTS 64 Bit
Gnome 42.9 X11
nvidia-driver-580 (Original version was probably nvidia-driver-47x... I updated after this error and still had a similar outcome)

./koboldcpp-1.108.2 
***
Welcome to KoboldCpp - Version 1.108.2
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend (flag=0)

Loading Chat Completions Adapter: /tmp/_MEIUnX178/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
Auto Recommended GPU Layers: 47
GPU layers is default: Will enable AutoFit for increased estimation accuracy.
System: Linux #202510191616~1762410050~22.04~898873a SMP PREEMPT_DYNAMIC Thu N x86_64 x86_64
Detected Available GPU Memory: 24576 MB
Detected Available RAM: 97187 MB
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', autofit=True, autofitpadding=1024, batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=16384, debugmode=0, defaultgenamt=896, device='', downloaddir='', draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=47, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=False, jinja_tools=False, launch=True, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mcpfile=None, mmproj=None, mmprojcpu=False, model=[], model_param='/home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, noflashattention=False, nommap=False, nomodel=False, nopipelineparallel=False, noshift=False, onready='', overridekv=None, overridenativecontext=0, overridetensors=None, password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora=None, sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=11, sdtiledvae=768, sdupscaler='', sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=11, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', usecpu=False, usecuda=['normal', 'mmq'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: /home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf

The reported GGUF Arch is: qwen35
Arch Category: 0

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
CUDA MMQ: True
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
Autofit Success: 1, Autofit Result: -c 16512 -ngl -1
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:05:00.0) - 23837 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:07:00.0) - 23845 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from /home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 18.06 GiB (5.77 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 248075 '<tts_text_bos_single>' is not marked as EOG
load: control token: 248073 '<tts_text_bos>' is not marked as EOG
load: control token: 248072 '<tts_pad>' is not marked as EOG
load: control token: 248071 '<|audio_end|>' is not marked as EOG
load: control token: 248061 '<|fim_middle|>' is not marked as EOG
load: control token: 248055 '<|vision_pad|>' is not marked as EOG
load: control token: 248052 '<|quad_end|>' is not marked as EOG
load: control token: 248049 '<|box_start|>' is not marked as EOG
load: control token: 248048 '<|object_ref_end|>' is not marked as EOG
load: control token: 248045 '<|im_start|>' is not marked as EOG
load: control token: 248057 '<|video_pad|>' is not marked as EOG
load: control token: 248070 '<|audio_start|>' is not marked as EOG
load: control token: 248056 '<|image_pad|>' is not marked as EOG
load: control token: 248054 '<|vision_end|>' is not marked as EOG
load: control token: 248060 '<|fim_prefix|>' is not marked as EOG
load: control token: 248050 '<|box_end|>' is not marked as EOG
load: control token: 248074 '<tts_text_eod>' is not marked as EOG
load: control token: 248053 '<|vision_start|>' is not marked as EOG
load: control token: 248062 '<|fim_suffix|>' is not marked as EOG
load: control token: 248047 '<|object_ref_start|>' is not marked as EOG
load: control token: 248051 '<|quad_start|>' is not marked as EOG
load: control token: 248076 '<|audio_pad|>' is not marked as EOG
load: setting token '</think>' (248069) attribute to USER_DEFINED (16), old attributes: 16
load: setting token '<think>' (248068) attribute to USER_DEFINED (16), old attributes: 16
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 64
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = ?B
print_info: model params          = 26.90 B
print_info: general.name          = Qwen3.5 27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: relocated tensors: 1 of 851
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:          CPU model buffer size =   833.59 MiB
load_tensors:        CUDA0 model buffer size =  8591.18 MiB
load_tensors:        CUDA1 model buffer size =  9065.65 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
...............................................load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
.............................................

MRope is used, context shift will be disabled!
Automatic RoPE Scaling: Using model internal value.
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16640
llama_context: n_ctx_seq     = 16640
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (16640) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:      CUDA0 KV buffer size =   520.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   520.00 MiB
llama_kv_cache: size = 1040.00 MiB ( 16640 cells,  16 layers,  1/1 seqs), K (f16):  520.00 MiB, V (f16):  520.00 MiB
llama_memory_recurrent: layer   3: skipped
llama_memory_recurrent: layer   7: skipped
llama_memory_recurrent: layer  11: skipped
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent: layer  19: skipped
llama_memory_recurrent: layer  23: skipped
llama_memory_recurrent: layer  27: skipped
llama_memory_recurrent: layer  31: skipped
llama_memory_recurrent: layer  35: skipped
llama_memory_recurrent: layer  39: skipped
llama_memory_recurrent: layer  43: skipped
llama_memory_recurrent: layer  47: skipped
llama_memory_recurrent: layer  51: skipped
llama_memory_recurrent: layer  55: skipped
llama_memory_recurrent: layer  59: skipped
llama_memory_recurrent: layer  63: skipped
llama_memory_recurrent:      CUDA0 RS buffer size =    77.93 MiB
llama_memory_recurrent:      CUDA1 RS buffer size =    71.70 MiB
llama_memory_recurrent: size =  149.62 MiB (     1 cells,  64 layers,  1 seqs), R (f32):    5.62 MiB, S (f32):  144.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: max_nodes = 27232
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
sched_reserve:      CUDA0 compute buffer size =   338.43 MiB
sched_reserve:      CUDA1 compute buffer size =   954.32 MiB
sched_reserve:  CUDA_Host compute buffer size =   150.08 MiB
sched_reserve: graph nodes  = 12729 (with bs=512), 4713 (with bs=1)
sched_reserve: graph splits = 6 (with bs=512), 5 (with bs=1)
sched_reserve: reserve took 62.12 ms, sched copies = 4
Threadpool set to 11 threads and 11 blasthreads...
attach_threadpool: call

This architecture has explicitly disabled the BOS token - if you need it, you must add it manually.
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat completion heuristic: ChatML (Generic)
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl MCPBridge
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 16384, "max_length": 4096, "rep_pen": 1.05, "temperature": 0.75, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "trim_stop": true, "genkey": "KCPP8731", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "smoothing_curve": 1, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "replace_instruct_placeholders": true, "presence_penalty": 0, "logit_bias": {}, "adaptive_target": -1, "adaptive_decay": 0.9, "stop_sequence": ["{{[INPUT]}}", "{{[OUTPUT]}}"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "{{[INPUT]}} INPUT REACTED {{[OUTPUT]}}"}

Processing Prompt [BATCH] (512 / 1049 tokens)CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_device_event_synchronize at ggml/src/ggml-cuda/ggml-cuda.cu:4947
  cudaEventSynchronize((cudaEvent_t)event->context)
ggml/src/ggml-cuda/ggml-cuda.cu:99: CUDA error
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions