-
Notifications
You must be signed in to change notification settings - Fork 626
Description
Describe the Issue
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_device_event_synchronize at ggml/src/ggml-cuda/ggml-cuda.cu:4947
cudaEventSynchronize((cudaEvent_t)event->context)
ggml/src/ggml-cuda/ggml-cuda.cu:99: CUDA error
Aborted (core dumped)
This issue only seems to effect the Qwen 3.5 series (specifically the 27b model) when it is split across two 3090 Nvidia GPUs. If I don't split the model it doesn't crash, and if I split any of my other models the software doesn't crash.
Tested Multiple Versions
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-UD-Q8_K_XL.gguf (split crashes)
https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/blob/main/Qwen_Qwen3.5-27B-Q8_0.gguf (split crashes)
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/blob/main/Qwen3.5-27B-Q5_K_M.gguf (split crashes, not split works fine)
Additional Information:
koboldcpp-1.108.2
PROCESSOR QEMU Standard PC Q35 + ICH9, 2009 (Running in a VM with GPU Passthrough)
RAM 109.4 GiB
13th Gen Intel Core\u2122 i9-13900K 25
NVIDIA Corporation GA102 [GeForce RTX 3090] / NVIDIA Corporation GA102 [GeForce RTX 3090]
Pop!_OS 22.04 LTS 64 Bit
Gnome 42.9 X11
nvidia-driver-580 (Original version was probably nvidia-driver-47x... I updated after this error and still had a similar outcome)
./koboldcpp-1.108.2
***
Welcome to KoboldCpp - Version 1.108.2
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend (flag=0)
Loading Chat Completions Adapter: /tmp/_MEIUnX178/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
Auto Recommended GPU Layers: 47
GPU layers is default: Will enable AutoFit for increased estimation accuracy.
System: Linux #202510191616~1762410050~22.04~898873a SMP PREEMPT_DYNAMIC Thu N x86_64 x86_64
Detected Available GPU Memory: 24576 MB
Detected Available RAM: 97187 MB
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', autofit=True, autofitpadding=1024, batchsize=512, benchmark=None, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=16384, debugmode=0, defaultgenamt=896, device='', downloaddir='', draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=47, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=False, jinja_tools=False, launch=True, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mcpfile=None, mmproj=None, mmprojcpu=False, model=[], model_param='/home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, noflashattention=False, nommap=False, nomodel=False, nopipelineparallel=False, noshift=False, onready='', overridekv=None, overridenativecontext=0, overridetensors=None, password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory=None, prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora=None, sdloramult=1.0, sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=11, sdtiledvae=768, sdupscaler='', sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=False, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=11, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', usecpu=False, usecuda=['normal', 'mmq'], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: /home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf
The reported GGUF Arch is: qwen35
Arch Category: 0
---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
Autofit Success: 1, Autofit Result: -c 16512 -ngl -1
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:05:00.0) - 23837 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:07:00.0) - 23845 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from /home/linux/Desktop/passthrough/llm/koboldcpp/models/Qwen_Qwen3.5-27B-Q5_K_M.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 18.06 GiB (5.77 BPW)
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 248075 '<tts_text_bos_single>' is not marked as EOG
load: control token: 248073 '<tts_text_bos>' is not marked as EOG
load: control token: 248072 '<tts_pad>' is not marked as EOG
load: control token: 248071 '<|audio_end|>' is not marked as EOG
load: control token: 248061 '<|fim_middle|>' is not marked as EOG
load: control token: 248055 '<|vision_pad|>' is not marked as EOG
load: control token: 248052 '<|quad_end|>' is not marked as EOG
load: control token: 248049 '<|box_start|>' is not marked as EOG
load: control token: 248048 '<|object_ref_end|>' is not marked as EOG
load: control token: 248045 '<|im_start|>' is not marked as EOG
load: control token: 248057 '<|video_pad|>' is not marked as EOG
load: control token: 248070 '<|audio_start|>' is not marked as EOG
load: control token: 248056 '<|image_pad|>' is not marked as EOG
load: control token: 248054 '<|vision_end|>' is not marked as EOG
load: control token: 248060 '<|fim_prefix|>' is not marked as EOG
load: control token: 248050 '<|box_end|>' is not marked as EOG
load: control token: 248074 '<tts_text_eod>' is not marked as EOG
load: control token: 248053 '<|vision_start|>' is not marked as EOG
load: control token: 248062 '<|fim_suffix|>' is not marked as EOG
load: control token: 248047 '<|object_ref_start|>' is not marked as EOG
load: control token: 248051 '<|quad_start|>' is not marked as EOG
load: control token: 248076 '<|audio_pad|>' is not marked as EOG
load: setting token '</think>' (248069) attribute to USER_DEFINED (16), old attributes: 16
load: setting token '<think>' (248068) attribute to USER_DEFINED (16), old attributes: 16
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch = qwen35
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 5120
print_info: n_embd_inp = 5120
print_info: n_layer = 64
print_info: n_head = 24
print_info: n_head_kv = 4
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 6
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 17408
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [11, 11, 10, 0]
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 6144
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 48
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = ?B
print_info: model params = 26.90 B
print_info: general.name = Qwen3.5 27B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248044 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: relocated tensors: 1 of 851
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: CPU model buffer size = 833.59 MiB
load_tensors: CUDA0 model buffer size = 8591.18 MiB
load_tensors: CUDA1 model buffer size = 9065.65 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
...............................................load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
.............................................
MRope is used, context shift will be disabled!
Automatic RoPE Scaling: Using model internal value.
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16640
llama_context: n_ctx_seq = 16640
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (16640) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.95 MiB
llama_kv_cache: CUDA0 KV buffer size = 520.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 520.00 MiB
llama_kv_cache: size = 1040.00 MiB ( 16640 cells, 16 layers, 1/1 seqs), K (f16): 520.00 MiB, V (f16): 520.00 MiB
llama_memory_recurrent: layer 3: skipped
llama_memory_recurrent: layer 7: skipped
llama_memory_recurrent: layer 11: skipped
llama_memory_recurrent: layer 15: skipped
llama_memory_recurrent: layer 19: skipped
llama_memory_recurrent: layer 23: skipped
llama_memory_recurrent: layer 27: skipped
llama_memory_recurrent: layer 31: skipped
llama_memory_recurrent: layer 35: skipped
llama_memory_recurrent: layer 39: skipped
llama_memory_recurrent: layer 43: skipped
llama_memory_recurrent: layer 47: skipped
llama_memory_recurrent: layer 51: skipped
llama_memory_recurrent: layer 55: skipped
llama_memory_recurrent: layer 59: skipped
llama_memory_recurrent: layer 63: skipped
llama_memory_recurrent: CUDA0 RS buffer size = 77.93 MiB
llama_memory_recurrent: CUDA1 RS buffer size = 71.70 MiB
llama_memory_recurrent: size = 149.62 MiB ( 1 cells, 64 layers, 1 seqs), R (f32): 5.62 MiB, S (f32): 144.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: max_nodes = 27232
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
sched_reserve: CUDA0 compute buffer size = 338.43 MiB
sched_reserve: CUDA1 compute buffer size = 954.32 MiB
sched_reserve: CUDA_Host compute buffer size = 150.08 MiB
sched_reserve: graph nodes = 12729 (with bs=512), 4713 (with bs=1)
sched_reserve: graph splits = 6 (with bs=512), 5 (with bs=1)
sched_reserve: reserve took 62.12 ms, sched copies = 4
Threadpool set to 11 threads and 11 blasthreads...
attach_threadpool: call
This architecture has explicitly disabled the BOS token - if you need it, you must add it manually.
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat completion heuristic: ChatML (Generic)
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl MCPBridge
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
======
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 16384, "max_length": 4096, "rep_pen": 1.05, "temperature": 0.75, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "trim_stop": true, "genkey": "KCPP8731", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "smoothing_curve": 1, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "replace_instruct_placeholders": true, "presence_penalty": 0, "logit_bias": {}, "adaptive_target": -1, "adaptive_decay": 0.9, "stop_sequence": ["{{[INPUT]}}", "{{[OUTPUT]}}"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "{{[INPUT]}} INPUT REACTED {{[OUTPUT]}}"}
Processing Prompt [BATCH] (512 / 1049 tokens)CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_device_event_synchronize at ggml/src/ggml-cuda/ggml-cuda.cu:4947
cudaEventSynchronize((cudaEvent_t)event->context)
ggml/src/ggml-cuda/ggml-cuda.cu:99: CUDA error
Aborted (core dumped)