Skip to content

DFlash speculative + multi-slot (-np > 1) crashes with CUDA error on first multi-slot decode (sm_120 Blackwell) #24

@alexop1000

Description

@alexop1000

Summary

llama-server with --spec-type dflash crashes immediately on the first multi-slot decode whenever -np > 1. Reproduces on v0.1.2 (prebuilt Win-CUDA-13.1) and on the current main (ba1fdce5, built from source) on NVIDIA RTX PRO 6000 Blackwell (sm_120, compute cap 12.0, driver 596.36, Windows 11, CUDA 13.2 toolkit).

-np 1 with the same flags works fine (verified: ~127 t/s at full 262144 ctx). -np 8 without the drafter also works (verified: ~415 t/s aggregate). So the crash is specific to DFlash + multi-slot.

Reproducer

llama-server.exe \
  -m Qwen3.6-27B-Q8_0.gguf \
  -md dflash-draft-3.6-q8_0.gguf \
  --spec-type dflash \
  -ngl 99 -ngld 99 \
  -np 8 -c 262144 --kv-unified \
  -cd 256 -fa on -b 256 -ub 64 \
  --jinja --chat-template-kwargs '{\"enable_thinking\": false}' \
  --no-warmup

Target: `unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf`
Drafter: `spiritbuun/Qwen3.6-27B-DFlash-GGUF/dflash-draft-3.6-q8_0.gguf`

Then hit `/v1/chat/completions` with N = -np concurrent requests. Crash happens on the first decode ubatch after the last slot's prompt processing finishes. Even `-np 2` reproduces.

Crash on v0.1.2 (`-np 2`)

```
srv update_slots: decode ubatch: 5 tok, 66.3ms (13.25ms/tok)
C:\Users\anbee\projects\beellama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error
CUDA error: unspecified launch failure
current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
cudaStreamSynchronize(cuda_ctx->stream())
```

Crash on `main` ba1fdce (`-np 8`)

```
srv update_slots: decode ubatch: 29 tok, 269.9ms (9.31ms/tok)
ggml-cuda.cu:98: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
cudaStreamSynchronize(cuda_ctx->stream())
```

Note the error type differs (`unspecified launch failure` vs `illegal memory access`) but both surface at the same site (`cudaStreamSynchronize` post-decode) and at the same point in the slot lifecycle (immediately after all slots finish prompt-processing and the first multi-slot decode ubatch runs).

What I verified rules out

  • Not env / driver / build: same machine reproduces with v0.1.2 prebuilt and with locally compiled `ba1fdce5` (`CMAKE_CUDA_ARCHITECTURES=120`, CUDA 13.2, MSVC 19.44).
  • Not VRAM pressure: card is 96 GiB; only ~50 GiB used at startup.
  • Not the model file: Q8 target loads, runs perfectly at `-np 1` with DFlash (127 t/s @ 49.5% accept, 256k ctx).
  • Not multi-slot per se: `-np 8` without `--spec-type dflash` works fine, ~27 t/s/slot, ~200 t/s aggregate.
  • Not `--kv-unified` vs non-unified: crashes both ways.
  • Not the `-ot token_embd.weight=CUDA0` flag: removed it, still crashes.

Hardware / build

  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GiB, driver 596.36)
  • OS: Windows 11
  • CUDA toolkit: 13.2.78
  • Built with: `cmake -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120`

Happy to provide longer logs or run with extra debug flags if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions