CUDA error 700 : an illegal memory access was encountered #343

andymartin · 2023-12-02T19:37:52Z

RTX 3090
Windows 11
CUDA 12.3

Same result with WSL2 or Native.

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

// snip all the tensor stuff

llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q8_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 132.92 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 512.00 MiB
llama_new_context_with_model: kv self size = 512.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 8005.83 MiB (model: 7205.83 MiB, context: 800.00 MiB)
Entering chat...
Bot: Jade decided it was time to unwind after
CUDA error 700 at D:\a\LLamaSharp\LLamaSharp\ggml-cuda.cu:7576: an illegal memory access was encountered
current device: 0

martindevans · 2023-12-02T19:43:25Z

This looks like an error coming from llama.cpp itself, rather than LLamaSharp. Have you tried this model with llama.cpp directly to confirm if you get the same error?

andymartin · 2023-12-03T05:20:54Z

I have compiled llama.cpp with CUDA support and it works. I've tried it with a few different 7b models that work with llama.cpp but give this error with LlamaSharp. I've tried sending the same prompts to llama.cpp and that also works. And to make matters more confusing, it started working for a bit, then it started failing again.

martindevans · 2023-12-03T14:37:51Z

Could you get a stack trace from the exception? That'll tell us what C# code was running when it crashed.

JohnGalt1717 · 2023-12-04T15:58:51Z

I'm getting this too, and it hard crashes the host app even if you have try/catch all over the place.
In my case It's using Apple Silicon.

llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 4096.00 MiB
llama_build_graph: non-view tensors processed: 740/740
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/jameshancock/Repos/TheDailyFactum/Server/Tools/Chat/bin/Debug/net8.0/runtimes/osx-arm64/native/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 2139.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  4893.70 MiB, ( 4895.08 / 10922.67)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  4096.02 MiB, ( 8991.09 / 10922.67)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  2136.02 MiB, (11127.11 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /Users/runner/work/LLamaSharp/LLamaSharp/ggml-metal.m:1611: false
The program '[26619] Chat.dll' has exited with code 0 (0x0).

It should have fallen back automatically to CPU and swapped like crazy.

martindevans · 2023-12-04T16:51:33Z

command buffer 0 failed with status 5 seems to indicate an out-of-memory error (ref: ggerganov/llama.cpp#2048).

JohnGalt1717 · 2023-12-04T16:53:27Z

Right. But the actual issue here is that llama.cpp errors are crashing any LlamaSharp based .net application to the desktop. We can't handle these errors.

And in addition to the fact that these errors can't be handled, LLamaSharp can't fall back properly in many cases to CPU from GPU because of memory errors like other systems like LM Desktop do just fine.

The result is a doubly brittle system that is not deployable outside of very tightly controlled environments.

martindevans · 2023-12-04T18:35:50Z

Unfortunately I don't think there's any way we can handle a GGML_ASSERT. It's defined here to call abort() which is about as fatal as it gets!

JohnGalt1717 · 2023-12-04T18:39:32Z

According to MS's docs, the best way to work around abort() is to run in a separate process spun up in C# before calling into the C++ library.

martindevans · 2023-12-04T19:02:54Z

Yep that would be the only way to handle it (an abort() just destroys the process, with no way to recover).

That's not something LLamaSharp does internally at the moment (and personally I would say we're unlikely to, remaining just a wrapper around llama.cpp).

Instead imo the two ways to handle this would be at a higher level (load LLamaSharp in a separate process and interact with it) and at a lower level (contact the llama.cpp team ask them to use a recoverable kind of error detection where possible).

JohnGalt1717 · 2023-12-04T19:26:00Z

Would it not make sense for LlamaSharp as a project to request this? it would also benefit every other language consuming Llama.cpp. (and would help their own server)

martindevans · 2023-12-05T01:13:20Z

I can ask if you'd prefer not to, but LLamaSharp doesn't have any special pull in the llama.cpp project. To be honest at the moment I suspect any such request will be largely ignored (unless it's accompanied by PRs to implement better error handling).

JohnGalt1717 · 2023-12-08T22:45:47Z

Could you do so? This really is killing us because it doesn't allow us to fall back to not using the GPU when this occurs.

martindevans · 2023-12-09T00:24:43Z

I've opened up ggerganov/llama.cpp#4385

martindevans · 2023-12-09T00:25:33Z

Although I will say I wouldn't expect this to change quickly if at all! It woud be a large change in both LLamaSharp and llama.cpp! If it's an issue you currently have you'll want to split off your usage of LLamaSharp into a separate process.

martindevans · 2024-02-02T03:31:58Z

Some interesting discussion related to error handling in llama.cpp here: ggerganov/ggml#701

martindevans mentioned this issue Dec 9, 2023

Recoverable Error Handling ggerganov/llama.cpp#4385

Closed

4 tasks

martindevans added the Upstream Tracking an issue in llama.cpp label Jan 7, 2024

martindevans mentioned this issue Jan 7, 2024

0.8.1: GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:745: data #337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error 700 : an illegal memory access was encountered #343

CUDA error 700 : an illegal memory access was encountered #343

andymartin commented Dec 2, 2023

martindevans commented Dec 2, 2023

andymartin commented Dec 3, 2023

martindevans commented Dec 3, 2023

JohnGalt1717 commented Dec 4, 2023

martindevans commented Dec 4, 2023

JohnGalt1717 commented Dec 4, 2023 •

edited

Loading

martindevans commented Dec 4, 2023

JohnGalt1717 commented Dec 4, 2023

martindevans commented Dec 4, 2023

JohnGalt1717 commented Dec 4, 2023 •

edited

Loading

martindevans commented Dec 5, 2023

JohnGalt1717 commented Dec 8, 2023

martindevans commented Dec 9, 2023 •

edited

Loading

martindevans commented Dec 9, 2023

martindevans commented Feb 2, 2024

CUDA error 700 : an illegal memory access was encountered #343

CUDA error 700 : an illegal memory access was encountered #343

Comments

andymartin commented Dec 2, 2023

martindevans commented Dec 2, 2023

andymartin commented Dec 3, 2023

martindevans commented Dec 3, 2023

JohnGalt1717 commented Dec 4, 2023

martindevans commented Dec 4, 2023

JohnGalt1717 commented Dec 4, 2023 • edited Loading

martindevans commented Dec 4, 2023

JohnGalt1717 commented Dec 4, 2023

martindevans commented Dec 4, 2023

JohnGalt1717 commented Dec 4, 2023 • edited Loading

martindevans commented Dec 5, 2023

JohnGalt1717 commented Dec 8, 2023

martindevans commented Dec 9, 2023 • edited Loading

martindevans commented Dec 9, 2023

martindevans commented Feb 2, 2024

JohnGalt1717 commented Dec 4, 2023 •

edited

Loading

JohnGalt1717 commented Dec 4, 2023 •

edited

Loading

martindevans commented Dec 9, 2023 •

edited

Loading