DFlash segfault (exit 139) with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)

## Bug Report: DFlash Segfault with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)

### Environment
- **GPU:** NVIDIA RTX PRO 4500 Blackwell (SM 12.0, 32 GB VRAM)
- **OS:** Unraid 7.3, kernel 6.18.23
- **CUDA:** 12.8.1 (nvidia/cuda:12.8.1-devel-ubuntu24.04)
- **BeeLlama build:** main branch, cloned May 23, 2026
- **Docker:** Yes, nvidia runtime

### Models Used
- **Target:** unsloth/Qwen3.6-27B-GGUF Q5_K_M (19 GB)
- **Draft:** Anbeeld/Qwen3.6-27B-DFlash-GGUF Q4_K_M (986 MB)
- **mmproj:** mmproj-BF16.gguf (889 MB, optional)

### Command
```bash
llama-server \
  -m /models/Qwen3.6-27B-Q5_K_M.gguf \
  --spec-type dflash \
  --spec-draft-model /models/Qwen3.6-27B-DFlash-Q4_K_M.gguf \
  --spec-draft-ngl all \
  --spec-dflash-cross-ctx 1024 \
  --mmproj /models/mmproj-BF16.gguf \
  --reasoning on --jinja \
  -ngl all -c 524288 -np 1 -b 2048 -ub 512 \
  --kv-unified \
  --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq \
  --flash-attn on \
  --host 0.0.0.0 --port 11437
```

### Behavior
- Server **starts successfully** and passes health check
- DFlash draft model loads correctly (58 tensors, 5 blocks, vocab match confirmed)
- DFlash KV cache allocated (40 MB)
- GPU hidden capture ring buffer allocated (5 layers x 1024 slots x 5120 embd, ~200 MB)
- **Crash occurs on first inference request** with segfault (exit 139)

### Stack Trace
```
/tmp/beellama.cpp/build/bin/libllama-common.so.0(common_speculative_state_dflash::draft(...)+0x361)
/tmp/beellama.cpp/build/bin/libllama-common.so.0(common_speculative_draft(...)+0xbd)
```

### Key Log Output Before Crash
```
dflash: target/drafter info: target_ctx_train=262144 target_vocab=248320 drafter_vocab=248320 vocab_match=1 capture_min=1 capture_max=61
dflash: GPU hidden capture policy: allowed=1 forced_cpu=0 requested=1 target_devices=1 drafter_devices=1
dflash gpu ring: allocated 5 layers x 1024 slots x 5120 embd + staging (~200 MB)
dflash: GPU cross ring enabled (5 layers x 1024 slots x 5120 embd)
dflash_kv_cache_init: allocated DFlash drafter K/V cache: 40.0 MB (5 layers, 1024 tokens, 1024 elems/token)
dflash: drafter K/V projection cache enabled (1024-token window)
slot launch_slot_: id 0 | spec dm controller: adaptive=1 controller=profit
```

### Draft Model Metadata
```
general.architecture = dflash-draft
dflash-draft.block_count = 5
dflash-draft.context_length = 262144
dflash-draft.embedding_length = 5120
dflash-draft.dflash.block_size = 16
dflash-draft.dflash.target_layer_ids = [1, 16, 31, 46, 61]
```

### Attempts Made (All Crash Identically)
1. `-np 1` (single slot) — same crash
2. `-np 2` (two slots) — same crash
3. Without `--mmproj` — same crash
4. Different cache types (`q5_0/q4_1`, `turbo3_tcq`) — same crash
5. Different batch sizes (`-b 2048 -ub 512` vs `-b 4096 -ub 1024`) — same crash
6. Rebuilt with explicit CUDA arch `-DGGML_CUDA_ARCH=120` — same crash (note: cmake ignored the flag, all 237 cubins compiled as sm_52)
7. Both with and without `--spec-dflash-cross-ctx 1024` — same crash

### Additional Notes
- **100% reproducible** — every inference request triggers the crash
- Standard llama.cpp operations (model loading, KV cache, attention, vision) all work correctly on this GPU with the same binary
- The `atomic-tq-mtp-cuda` Docker image runs the same model at 36.4 tok/s without DFlash
- MTP speculative decoding works (57.8 tok/s at 128K context) on this hardware
- Build shows `CUDA : ARCHS = 520` even with `-DGGML_CUDA_ARCH=120` — cmake flag appears ignored
- However, CUDA forward compatibility means sm_52 code runs fine on Blackwell via PTX JIT — standard operations confirm this

### Hypothesis
Qwen3.6-27B uses a hybrid architecture (16 KV-attention layers + 48 Gated DeltaNet/SSM layers). The DFlash draft model references `target_layer_ids = [1, 16, 31, 46, 61]` which span both layer types. The crash in `common_speculative_state_dflash::draft()` may be related to how DFlash handles hidden states from the non-standard SSM/DeltaNet layers during the draft function.

The z-lab DFlash model page notes: \"The model is still under training, and inference engine support may not be fully available yet due to architectural changes, including causal SWA layers.\"

Thank you for this excellent fork — looking forward to getting DFlash working on Qwen3.6!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DFlash segfault (exit 139) with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120) #33

Bug Report: DFlash Segfault with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)

Environment

Models Used

Command

Behavior

Stack Trace

Key Log Output Before Crash

Draft Model Metadata

Attempts Made (All Crash Identically)

Additional Notes

Hypothesis

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

DFlash segfault (exit 139) with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120) #33

Description

Bug Report: DFlash Segfault with Qwen3.6-27B on RTX PRO 4500 Blackwell (SM120)

Environment

Models Used

Command

Behavior

Stack Trace

Key Log Output Before Crash

Draft Model Metadata

Attempts Made (All Crash Identically)

Additional Notes

Hypothesis

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions