Skip to content

DFlash segfault (exit 139) with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120) #33

@gdeyoung

Description

@gdeyoung

Bug Report: DFlash Segfault with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)

Environment

  • GPU: NVIDIA RTX PRO 4500 Blackwell (SM 12.0, 32 GB VRAM)
  • OS: Unraid 7.3, kernel 6.18.23
  • CUDA: 12.8.1 (nvidia/cuda:12.8.1-devel-ubuntu24.04)
  • BeeLlama build: main branch, cloned May 23, 2026
  • Docker: Yes, nvidia runtime

Models Used

  • Target: unsloth/Qwen3.6-27B-GGUF Q5_K_M (19 GB)
  • Draft: Anbeeld/Qwen3.6-27B-DFlash-GGUF Q4_K_M (986 MB)
  • mmproj: mmproj-BF16.gguf (889 MB, optional)

Command

llama-server \
  -m /models/Qwen3.6-27B-Q5_K_M.gguf \
  --spec-type dflash \
  --spec-draft-model /models/Qwen3.6-27B-DFlash-Q4_K_M.gguf \
  --spec-draft-ngl all \
  --spec-dflash-cross-ctx 1024 \
  --mmproj /models/mmproj-BF16.gguf \
  --reasoning on --jinja \
  -ngl all -c 524288 -np 1 -b 2048 -ub 512 \
  --kv-unified \
  --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
  --flash-attn on \
  --host 0.0.0.0 --port 11437

Behavior

  • Server starts successfully and passes health check
  • DFlash draft model loads correctly (58 tensors, 5 blocks, vocab match confirmed)
  • DFlash KV cache allocated (40 MB)
  • GPU hidden capture ring buffer allocated (5 layers x 1024 slots x 5120 embd, ~200 MB)
  • Crash occurs on first inference request with segfault (exit 139)

Stack Trace

/tmp/beellama.cpp/build/bin/libllama-common.so.0(common_speculative_state_dflash::draft(...)+0x361)
/tmp/beellama.cpp/build/bin/libllama-common.so.0(common_speculative_draft(...)+0xbd)

Key Log Output Before Crash

dflash: target/drafter info: target_ctx_train=262144 target_vocab=248320 drafter_vocab=248320 vocab_match=1 capture_min=1 capture_max=61
dflash: GPU hidden capture policy: allowed=1 forced_cpu=0 requested=1 target_devices=1 drafter_devices=1
dflash gpu ring: allocated 5 layers x 1024 slots x 5120 embd + staging (~200 MB)
dflash: GPU cross ring enabled (5 layers x 1024 slots x 5120 embd)
dflash_kv_cache_init: allocated DFlash drafter K/V cache: 40.0 MB (5 layers, 1024 tokens, 1024 elems/token)
dflash: drafter K/V projection cache enabled (1024-token window)
slot launch_slot_: id 0 | spec dm controller: adaptive=1 controller=profit

Draft Model Metadata

general.architecture = dflash-draft
dflash-draft.block_count = 5
dflash-draft.context_length = 262144
dflash-draft.embedding_length = 5120
dflash-draft.dflash.block_size = 16
dflash-draft.dflash.target_layer_ids = [1, 16, 31, 46, 61]

Attempts Made (All Crash Identically)

  1. -np 1 (single slot) — same crash
  2. -np 2 (two slots) — same crash
  3. Without --mmproj — same crash
  4. Different cache types (q5_0/q4_1, turbo3_tcq) — same crash
  5. Different batch sizes (-b 2048 -ub 512 vs -b 4096 -ub 1024) — same crash
  6. Rebuilt with explicit CUDA arch -DGGML_CUDA_ARCH=120 — same crash (note: cmake ignored the flag, all 237 cubins compiled as sm_52)
  7. Both with and without --spec-dflash-cross-ctx 1024 — same crash

Additional Notes

  • 100% reproducible — every inference request triggers the crash
  • Standard llama.cpp operations (model loading, KV cache, attention, vision) all work correctly on this GPU with the same binary
  • The atomic-tq-mtp-cuda Docker image runs the same model at 36.4 tok/s without DFlash
  • MTP speculative decoding works (57.8 tok/s at 128K context) on this hardware
  • Build shows CUDA : ARCHS = 520 even with -DGGML_CUDA_ARCH=120 — cmake flag appears ignored
  • However, CUDA forward compatibility means sm_52 code runs fine on Blackwell via PTX JIT — standard operations confirm this

Hypothesis

Qwen3.6-27B uses a hybrid architecture (16 KV-attention layers + 48 Gated DeltaNet/SSM layers). The DFlash draft model references target_layer_ids = [1, 16, 31, 46, 61] which span both layer types. The crash in common_speculative_state_dflash::draft() may be related to how DFlash handles hidden states from the non-standard SSM/DeltaNet layers during the draft function.

The z-lab DFlash model page notes: "The model is still under training, and inference engine support may not be fully available yet due to architectural changes, including causal SWA layers."

Thank you for this excellent fork — looking forward to getting DFlash working on Qwen3.6!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions