Summary
llama-server with --spec-type dflash crashes immediately on the first multi-slot decode whenever -np > 1. Reproduces on v0.1.2 (prebuilt Win-CUDA-13.1) and on the current main (ba1fdce5, built from source) on NVIDIA RTX PRO 6000 Blackwell (sm_120, compute cap 12.0, driver 596.36, Windows 11, CUDA 13.2 toolkit).
-np 1 with the same flags works fine (verified: ~127 t/s at full 262144 ctx). -np 8 without the drafter also works (verified: ~415 t/s aggregate). So the crash is specific to DFlash + multi-slot.
Reproducer
llama-server.exe \
-m Qwen3.6-27B-Q8_0.gguf \
-md dflash-draft-3.6-q8_0.gguf \
--spec-type dflash \
-ngl 99 -ngld 99 \
-np 8 -c 262144 --kv-unified \
-cd 256 -fa on -b 256 -ub 64 \
--jinja --chat-template-kwargs '{\"enable_thinking\": false}' \
--no-warmup
Target: `unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf`
Drafter: `spiritbuun/Qwen3.6-27B-DFlash-GGUF/dflash-draft-3.6-q8_0.gguf`
Then hit `/v1/chat/completions` with N = -np concurrent requests. Crash happens on the first decode ubatch after the last slot's prompt processing finishes. Even `-np 2` reproduces.
Crash on v0.1.2 (`-np 2`)
```
srv update_slots: decode ubatch: 5 tok, 66.3ms (13.25ms/tok)
C:\Users\anbee\projects\beellama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error
CUDA error: unspecified launch failure
current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
cudaStreamSynchronize(cuda_ctx->stream())
```
Crash on `main` ba1fdce (`-np 8`)
```
srv update_slots: decode ubatch: 29 tok, 269.9ms (9.31ms/tok)
ggml-cuda.cu:98: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
cudaStreamSynchronize(cuda_ctx->stream())
```
Note the error type differs (`unspecified launch failure` vs `illegal memory access`) but both surface at the same site (`cudaStreamSynchronize` post-decode) and at the same point in the slot lifecycle (immediately after all slots finish prompt-processing and the first multi-slot decode ubatch runs).
What I verified rules out
- Not env / driver / build: same machine reproduces with v0.1.2 prebuilt and with locally compiled `ba1fdce5` (`CMAKE_CUDA_ARCHITECTURES=120`, CUDA 13.2, MSVC 19.44).
- Not VRAM pressure: card is 96 GiB; only ~50 GiB used at startup.
- Not the model file: Q8 target loads, runs perfectly at `-np 1` with DFlash (127 t/s @ 49.5% accept, 256k ctx).
- Not multi-slot per se: `-np 8` without `--spec-type dflash` works fine, ~27 t/s/slot, ~200 t/s aggregate.
- Not `--kv-unified` vs non-unified: crashes both ways.
- Not the `-ot token_embd.weight=CUDA0` flag: removed it, still crashes.
Hardware / build
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GiB, driver 596.36)
- OS: Windows 11
- CUDA toolkit: 13.2.78
- Built with: `cmake -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120`
Happy to provide longer logs or run with extra debug flags if helpful.
Summary
llama-serverwith--spec-type dflashcrashes immediately on the first multi-slot decode whenever-np > 1. Reproduces on v0.1.2 (prebuilt Win-CUDA-13.1) and on the currentmain(ba1fdce5, built from source) on NVIDIA RTX PRO 6000 Blackwell (sm_120, compute cap 12.0, driver 596.36, Windows 11, CUDA 13.2 toolkit).-np 1with the same flags works fine (verified: ~127 t/s at full 262144 ctx).-np 8without the drafter also works (verified: ~415 t/s aggregate). So the crash is specific to DFlash + multi-slot.Reproducer
Target: `unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf`
Drafter: `spiritbuun/Qwen3.6-27B-DFlash-GGUF/dflash-draft-3.6-q8_0.gguf`
Then hit `/v1/chat/completions` with N = -np concurrent requests. Crash happens on the first decode ubatch after the last slot's prompt processing finishes. Even `-np 2` reproduces.
Crash on v0.1.2 (`-np 2`)
```
srv update_slots: decode ubatch: 5 tok, 66.3ms (13.25ms/tok)
C:\Users\anbee\projects\beellama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error
CUDA error: unspecified launch failure
current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
cudaStreamSynchronize(cuda_ctx->stream())
```
Crash on `main` ba1fdce (`-np 8`)
```
srv update_slots: decode ubatch: 29 tok, 269.9ms (9.31ms/tok)
ggml-cuda.cu:98: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
cudaStreamSynchronize(cuda_ctx->stream())
```
Note the error type differs (`unspecified launch failure` vs `illegal memory access`) but both surface at the same site (`cudaStreamSynchronize` post-decode) and at the same point in the slot lifecycle (immediately after all slots finish prompt-processing and the first multi-slot decode ubatch runs).
What I verified rules out
Hardware / build
Happy to provide longer logs or run with extra debug flags if helpful.