Skip to content

Eval bug: GPU ring buffer crash in llm_build_dflash_kv_update at long context prefill (~158K tokens) #16

@ppsx

Description

@ppsx

Name and Version

version: 9343 (233c99e)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

Intel(R) Core(TM) i7-14700K + NVIDIA RTX 5090 32GB VRAM

Models

unsloth/Qwen3.6-27B-GGUF:Q5_K_S + spiritbuun/Qwen3.6-27B-DFlash-GGUF:Q4_K_M

Problem description & steps to reproduce

Setup:

  • BeeLlama v0.1.0 (commit b9343-233c99ed7), built CUDA 13.2, RTX 5090 Blackwell
  • Target: unsloth/Qwen3.6-27B-GGUF Q5_K_S
  • Drafter: spiritbuun/Qwen3.6-27B-DFlash-GGUF Q4_K_M
  • Cache: --cache-type-k turbo4 --cache-type-v turbo3_tcq
  • Config: --ctx-size 262144 --kv-unified --spec-draft-n-max 8
    --no-spec-dm-adaptive --spec-dflash-cross-ctx 1024 (also tried 512)
    -ub 1024 -b 1024

Symptom:
NIAH-style prefill at ~200K tokens consistently aborts during
update_drafter_kv_cache at progress 158720/200010 (deterministic checkpoint).

Stack trace:
ggml-impl.h:318: fatal error
ggml_backend_sched_set_tensor_backend
llm_build_dflash_kv_update
llama_context::dflash_kv_cache_update_gpu
common_speculative_state_dflash::update_drafter_kv_cache
common_speculative_state_dflash::flush_prefill

Tried:

  • --spec-dflash-cross-ctx 512: crashes same way
  • 128K NIAH (same config): 40/40 passes
  • GGML_DFLASH_GPU_RING=0: WORKS at 200K (much slower, but completes)
  • DFlash disabled, only TCQ cache on Q5_K_S target: works at 200K

Diagnosis: appears localized to GPU ring path in DFlash long-context prefill.

Repro: 200K-token prompt to /v1/chat/completions, prefill in -ub 1024 chunks.
NIAH JSONL test set available on request.

First Bad Commit

No response

Relevant log output

Stacktrace:

slot update_slots: id  0 | task 0 | n_tokens = 157696, memory_seq_rm [157696, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 158720, batch.n_tokens = 1024, progress = 0.793560
srv  update_slots:   decode ubatch: 1024 tok, 1196.6ms (1.17ms/tok)
slot update_slots: id  0 | task 0 | n_tokens = 158720, memory_seq_rm [158720, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 159744, batch.n_tokens = 1024, progress = 0.798680
srv  update_slots:   decode ubatch: 1024 tok, 1203.2ms (1.17ms/tok)
slot update_slots: id  0 | task 0 | n_tokens = 159744, memory_seq_rm [159744, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 160768, batch.n_tokens = 1024, progress = 0.803800
srv  update_slots:   decode ubatch: 1024 tok, 1210.6ms (1.18ms/tok)
/data/ai/beellama.cpp/ggml/src/ggml-impl.h:318: fatal error
[New LWP 9970]
[New LWP 9969]
[New LWP 9968]
[New LWP 9967]
[New LWP 9966]
[New LWP 9965]
[New LWP 9964]
[New LWP 9963]
[New LWP 9962]
[New LWP 9961]
[New LWP 9960]
[New LWP 9959]
[New LWP 9958]
[New LWP 9957]
[New LWP 9956]
[New LWP 9955]
[New LWP 9954]
[New LWP 9953]
[New LWP 9952]
[New LWP 9951]
[New LWP 9950]
[New LWP 9949]
[New LWP 9948]
[New LWP 9947]
[New LWP 9946]
[New LWP 9945]
[New LWP 9944]
[New LWP 9943]
[New LWP 9942]
[New LWP 9941]
[New LWP 9940]
[New LWP 9936]
[New LWP 9931]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007e547e110813 in __GI___wait4 (pid=10532, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x00007e547e110813 in __GI___wait4 (pid=10532, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007e547ec2c503 in ggml_print_backtrace () from /data/ai/ai-beellama/bin/../lib/libggml-base.so.0
#2  0x00007e547ec2c6ab in ggml_abort () from /data/ai/ai-beellama/bin/../lib/libggml-base.so.0
#3  0x00007e547ec469e3 in ggml_backend_sched_set_tensor_backend () from /data/ai/ai-beellama/bin/../lib/libggml-base.so.0
#4  0x00007e547e8bb34f in std::_Function_handler<void (llama_ubatch const&, ggml_tensor*, char const*, int), llama_context::graph_get_cb() const::{lambda(llama_ubatch const&, ggml_tensor*, char const*, int)#1}>::_M_invoke(std::_Any_data const&, llama_ubatch const&, ggml_tensor*&&, char const*&&, int&&) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#5  0x00007e547e90835e in llm_graph_context::cb(ggml_tensor*, char const*, int) const () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#6  0x00007e547e9086c1 in llm_graph_context::build_norm(ggml_tensor*, ggml_tensor*, ggml_tensor*, llm_norm_type, int) const () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#7  0x00007e547ea3713a in llm_build_dflash_kv_update::llm_build_dflash_kv_update(llama_model const&, llm_graph_params const&) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#8  0x00007e547e960e58 in llama_model::build_graph(llm_graph_params const&) const () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#9  0x00007e547e8c6c07 in llama_context::dflash_kv_cache_update(int) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#10 0x00007e547e8c74e9 in llama_context::dflash_kv_cache_update_gpu(void const*, int, int, int, void (*)(void*, void const*, unsigned long, unsigned long)) () from /data/ai/ai-beellama/bin/../lib/libllama.so.0
#11 0x00007e547f0de749 in common_speculative_state_dflash::update_drafter_kv_cache(int) () from /data/ai/ai-beellama/bin/../lib/libllama-common.so.0
#12 0x00007e547f0df0a7 in common_speculative_state_dflash::flush_prefill() () from /data/ai/ai-beellama/bin/../lib/libllama-common.so.0
#13 0x00007e547f0d6acd in common_speculative_flush_prefill(common_speculative*) () from /data/ai/ai-beellama/bin/../lib/libllama-common.so.0
#14 0x000062354a2a1e1b in server_context_impl::update_slots() ()
#15 0x000062354a340d2c in server_queue::start_loop(long) ()
#16 0x000062354a1fc617 in main ()
[Inferior 1 (process 9930) detached]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions