Diagnosis: appears localized to GPU ring path in DFlash long-context prefill.
Repro: 200K-token prompt to /v1/chat/completions, prefill in -ub 1024 chunks.
NIAH JSONL test set available on request.
slot update_slots: id 0 | task 0 | n_tokens = 157696, memory_seq_rm [157696, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 158720, batch.n_tokens = 1024, progress = 0.793560
srv update_slots: decode ubatch: 1024 tok, 1196.6ms (1.17ms/tok)
slot update_slots: id 0 | task 0 | n_tokens = 158720, memory_seq_rm [158720, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 159744, batch.n_tokens = 1024, progress = 0.798680
srv update_slots: decode ubatch: 1024 tok, 1203.2ms (1.17ms/tok)
slot update_slots: id 0 | task 0 | n_tokens = 159744, memory_seq_rm [159744, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 160768, batch.n_tokens = 1024, progress = 0.803800
srv update_slots: decode ubatch: 1024 tok, 1210.6ms (1.18ms/tok)
/data/ai/beellama.cpp/ggml/src/ggml-impl.h:318: fatal error
[New LWP 9970]
[New LWP 9969]
[New LWP 9968]
[New LWP 9967]
[New LWP 9966]
[New LWP 9965]
[New LWP 9964]
[New LWP 9963]
[New LWP 9962]
[New LWP 9961]
[New LWP 9960]
[New LWP 9959]
[New LWP 9958]
[New LWP 9957]
[New LWP 9956]
[New LWP 9955]
[New LWP 9954]
[New LWP 9953]
[New LWP 9952]
[New LWP 9951]
[New LWP 9950]
[New LWP 9949]
[New LWP 9948]
[New LWP 9947]
[New LWP 9946]
[New LWP 9945]
[New LWP 9944]
[New LWP 9943]
[New LWP 9942]
[New LWP 9941]
[New LWP 9940]
[New LWP 9936]
[New LWP 9931]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007e547e110813 in __GI___wait4 (pid=10532, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x00007e547e110813 in __GI___wait4 (pid=10532, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007e547ec2c503 in ggml_print_backtrace () from /data/ai/ai-beellama/bin/../lib/libggml-base.so.0
#2 0x00007e547ec2c6ab in ggml_abort () from /data/ai/ai-beellama/bin/../lib/libggml-base.so.0
#3 0x00007e547ec469e3 in ggml_backend_sched_set_tensor_backend () from /data/ai/ai-beellama/bin/../lib/libggml-base.so.0
#4 0x00007e547e8bb34f in std::_Function_handler<void (llama_ubatch const&, ggml_tensor*, char const*, int), llama_context::graph_get_cb() const::{lambda(llama_ubatch const&, ggml_tensor*, char const*, int)#1}>::_M_invoke(std::_Any_data const&, llama_ubatch const&, ggml_tensor*&&, char const*&&, int&&) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#5 0x00007e547e90835e in llm_graph_context::cb(ggml_tensor*, char const*, int) const () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#6 0x00007e547e9086c1 in llm_graph_context::build_norm(ggml_tensor*, ggml_tensor*, ggml_tensor*, llm_norm_type, int) const () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#7 0x00007e547ea3713a in llm_build_dflash_kv_update::llm_build_dflash_kv_update(llama_model const&, llm_graph_params const&) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#8 0x00007e547e960e58 in llama_model::build_graph(llm_graph_params const&) const () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#9 0x00007e547e8c6c07 in llama_context::dflash_kv_cache_update(int) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#10 0x00007e547e8c74e9 in llama_context::dflash_kv_cache_update_gpu(void const*, int, int, int, void (*)(void*, void const*, unsigned long, unsigned long)) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#11 0x00007e547f0de749 in common_speculative_state_dflash::update_drafter_kv_cache(int) () from /data/ai/ai-beellama/bin/../lib/libllama-common.so.0
#12 0x00007e547f0df0a7 in common_speculative_state_dflash::flush_prefill() () from /data/ai/ai-beellama/bin/../lib/libllama-common.so.0
#13 0x00007e547f0d6acd in common_speculative_flush_prefill(common_speculative*) () from /data/ai/ai-beellama/bin/../lib/libllama-common.so.0
#14 0x000062354a2a1e1b in server_context_impl::update_slots() ()
#15 0x000062354a340d2c in server_queue::start_loop(long) ()
#16 0x000062354a1fc617 in main ()
[Inferior 1 (process 9930) detached]
Name and Version
version: 9343 (233c99e)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel(R) Core(TM) i7-14700K + NVIDIA RTX 5090 32GB VRAM
Models
unsloth/Qwen3.6-27B-GGUF:Q5_K_S + spiritbuun/Qwen3.6-27B-DFlash-GGUF:Q4_K_M
Problem description & steps to reproduce
Setup:
--no-spec-dm-adaptive --spec-dflash-cross-ctx 1024 (also tried 512)
-ub 1024 -b 1024
Symptom:
NIAH-style prefill at ~200K tokens consistently aborts during
update_drafter_kv_cache at progress 158720/200010 (deterministic checkpoint).
Stack trace:
ggml-impl.h:318: fatal error
ggml_backend_sched_set_tensor_backend
llm_build_dflash_kv_update
llama_context::dflash_kv_cache_update_gpu
common_speculative_state_dflash::update_drafter_kv_cache
common_speculative_state_dflash::flush_prefill
Tried:
Diagnosis: appears localized to GPU ring path in DFlash long-context prefill.
Repro: 200K-token prompt to /v1/chat/completions, prefill in -ub 1024 chunks.
NIAH JSONL test set available on request.
First Bad Commit
No response
Relevant log output
Stacktrace: