DFlash speculative + multi-slot (`-np > 1`) crashes with CUDA error on first multi-slot decode (sm_120 Blackwell)

## Summary

`llama-server` with `--spec-type dflash` crashes immediately on the first multi-slot decode whenever `-np > 1`. Reproduces on v0.1.2 (prebuilt Win-CUDA-13.1) **and** on the current `main` (`ba1fdce5`, built from source) on NVIDIA RTX PRO 6000 Blackwell (sm_120, compute cap 12.0, driver 596.36, Windows 11, CUDA 13.2 toolkit).

`-np 1` with the same flags works fine (verified: ~127 t/s at full 262144 ctx). `-np 8` *without* the drafter also works (verified: ~415 t/s aggregate). So the crash is specific to DFlash + multi-slot.

## Reproducer

```
llama-server.exe \
  -m Qwen3.6-27B-Q8_0.gguf \
  -md dflash-draft-3.6-q8_0.gguf \
  --spec-type dflash \
  -ngl 99 -ngld 99 \
  -np 8 -c 262144 --kv-unified \
  -cd 256 -fa on -b 256 -ub 64 \
  --jinja --chat-template-kwargs '{\"enable_thinking\": false}' \
  --no-warmup
```

Target: \`unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf\`
Drafter: \`spiritbuun/Qwen3.6-27B-DFlash-GGUF/dflash-draft-3.6-q8_0.gguf\`

Then hit \`/v1/chat/completions\` with **N = -np** concurrent requests. Crash happens on the first decode ubatch after the last slot's prompt processing finishes. Even \`-np 2\` reproduces.

## Crash on v0.1.2 (\`-np 2\`)

\`\`\`
srv  update_slots:   decode ubatch: 5 tok, 66.3ms (13.25ms/tok)
C:\Users\anbee\projects\beellama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error
CUDA error: unspecified launch failure
  current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
  cudaStreamSynchronize(cuda_ctx->stream())
\`\`\`

## Crash on \`main\` ba1fdce5 (\`-np 8\`)

\`\`\`
srv  update_slots:   decode ubatch: 29 tok, 269.9ms (9.31ms/tok)
ggml-cuda.cu:98: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:3129
  cudaStreamSynchronize(cuda_ctx->stream())
\`\`\`

Note the error type differs (\`unspecified launch failure\` vs \`illegal memory access\`) but both surface at the same site (\`cudaStreamSynchronize\` post-decode) and at the same point in the slot lifecycle (immediately after all slots finish prompt-processing and the first multi-slot decode ubatch runs).

## What I verified rules out

- **Not env / driver / build**: same machine reproduces with v0.1.2 prebuilt and with locally compiled \`ba1fdce5\` (\`CMAKE_CUDA_ARCHITECTURES=120\`, CUDA 13.2, MSVC 19.44).
- **Not VRAM pressure**: card is 96 GiB; only ~50 GiB used at startup.
- **Not the model file**: Q8 target loads, runs perfectly at \`-np 1\` with DFlash (127 t/s @ 49.5% accept, 256k ctx).
- **Not multi-slot per se**: \`-np 8\` without \`--spec-type dflash\` works fine, ~27 t/s/slot, ~200 t/s aggregate.
- **Not \`--kv-unified\` vs non-unified**: crashes both ways.
- **Not the \`-ot token_embd.weight=CUDA0\` flag**: removed it, still crashes.

## Hardware / build

- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GiB, driver 596.36)
- OS: Windows 11
- CUDA toolkit: 13.2.78
- Built with: \`cmake -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120\`

Happy to provide longer logs or run with extra debug flags if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DFlash speculative + multi-slot (`-np > 1`) crashes with CUDA error on first multi-slot decode (sm_120 Blackwell) #24

Summary

Reproducer

Crash on v0.1.2 (`-np 2`)

Crash on `main` `ba1fdce` (`-np 8`)

What I verified rules out

Hardware / build

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

DFlash speculative + multi-slot (-np > 1) crashes with CUDA error on first multi-slot decode (sm_120 Blackwell) #24

Description

Summary

Reproducer

Crash on v0.1.2 (`-np 2`)

Crash on `main` ba1fdce (`-np 8`)

What I verified rules out

Hardware / build

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

DFlash speculative + multi-slot (`-np > 1`) crashes with CUDA error on first multi-slot decode (sm_120 Blackwell) #24

Crash on `main` `ba1fdce` (`-np 8`)