SD speeds up LLM inference by using a small draft model to propose K tokens, then verifying them in a single target forward pass. If the target agrees, you get K tokens for the cost of 1 target step (plus cheap draft steps). Acceptance rates of 0.8-0.95 are common when draft and target share a family.
A custom speculative decoding engine built in Triton, designed as a drop-in plugin for SGLang (v0.5.11). The goal is to provide an alternative SD backend with correct-by-construction verify logic and Triton kernels for GQA decode attention, targeting the SGLang Docker runtime on AMD MI300X.
Validated the full pipeline (prefill, decode, and spec verify) against vanilla PyTorch/HF — 100% argmax match across all modes. I've tested with the GPT family because thats what i can run on my personal GPU (4gb) and I'm outperforming SGLang on both throughput and correctness on this task. but ik SGLang isn't really tuned for this gap and the model is too small for SD to properly work. i am benchmarking bigger models
| Benchmark | B=1 | B=2 | B=4 | B=8 |
|---|---|---|---|---|
| SGLang non-spec (tok/s) | 41.5 | 40.1 | 39.3 | 24.4 |
| SGLang spec (tok/s) | 95.2 | 77.4 | 61.2 | 57.2 |
| SGLang's native speculative decoding with EAGLE draft head achieves 1.56–2.34× speedup over its own non-spec baseline. Acceptance rates are highest at B=1 (lowest queue pressure) and drop as batch size increases — more tokens in flight means the draft predictions are likelier to diverge from the target, reducing the effective acceptance window. |
| Benchmark | B=1 | B=2 | B=4 | B=8 |
|---|---|---|---|---|
| tri-sds non-spec (tok/s) | 30.6 | 25.5 | 24.1 | 20.1 |
| tri-sds spec (tok/s) | 76.1 | 43.9 | 37.5 | 35.3 |
The speedup ratios (1.56–2.49×) closely match SGLang's, confirming the verification logic is correct and the draft acceptance rates are equivalent. However, absolute throughput is ~30% lower at all batch sizes. This is expected — tri-sds is a research prototype with custom Triton kernels that lack the mature fusion, memory scheduling, and cache management that SGLang's vLLM backend has spent years optimizing. The tri-sds prefill kernel is a simple element-wise Triton matmul without flash-attention-level tiling; the decode kernel is a proof-of-concept GQA kernel that doesn't use warp specialization or async copy. The point of tri-sds is not to beat SGLang on throughput, but to provide a fully transparent, Triton SD implementation where every line of the verify loop and every kernel is readable and modifiable.
tri-sds demonstrates that correct EAGLE-style speculative decoding can be implemented entirely in Triton, with acceptance rates matching SGLang's production backend. The ~30% throughput gap is the cost of transparency — and a ceiling that can be closed incrementally by fusing kernels, adding flash-attention prefill, and porting the decode kernel to use async-copy pipelines.