tri-sds — Triton Speculative Decoding Inference

What is Speculative Decoding?

SD speeds up LLM inference by using a small draft model to propose K tokens, then verifying them in a single target forward pass. If the target agrees, you get K tokens for the cost of 1 target step (plus cheap draft steps). Acceptance rates of 0.8-0.95 are common when draft and target share a family.

What this project is

A custom speculative decoding engine built in Triton, designed as a drop-in plugin for SGLang (v0.5.11). The goal is to provide an alternative SD backend with correct-by-construction verify logic and Triton kernels for GQA decode attention, targeting the SGLang Docker runtime on AMD MI300X.

Correctness Benchmarks

Validated the full pipeline (prefill, decode, and spec verify) against vanilla PyTorch/HF — 100% argmax match across all modes. I've tested with the GPT family because thats what i can run on my personal GPU (4gb) and I'm outperforming SGLang on both throughput and correctness on this task. but ik SGLang isn't really tuned for this gap and the model is too small for SD to properly work. i am benchmarking bigger models

Qwen3 Benchmarks on AMD MI300X

SGLang baseline (ROCm 7.2, torch 2.9.1, BF16, Qwen3-4B→Qwen3-32B, MI300X×1)

Benchmark	B=1	B=2	B=4	B=8
SGLang non-spec (tok/s)	41.5	40.1	39.3	24.4
SGLang spec (tok/s)	95.2	77.4	61.2	57.2
SGLang's native speculative decoding with EAGLE draft head achieves 1.56–2.34× speedup over its own non-spec baseline. Acceptance rates are highest at B=1 (lowest queue pressure) and drop as batch size increases — more tokens in flight means the draft predictions are likelier to diverge from the target, reducing the effective acceptance window.

tri-sds (same hardware, same model)

Benchmark	B=1	B=2	B=4	B=8
tri-sds non-spec (tok/s)	30.6	25.5	24.1	20.1
tri-sds spec (tok/s)	76.1	43.9	37.5	35.3

The speedup ratios (1.56–2.49×) closely match SGLang's, confirming the verification logic is correct and the draft acceptance rates are equivalent. However, absolute throughput is ~30% lower at all batch sizes. This is expected — tri-sds is a research prototype with custom Triton kernels that lack the mature fusion, memory scheduling, and cache management that SGLang's vLLM backend has spent years optimizing. The tri-sds prefill kernel is a simple element-wise Triton matmul without flash-attention-level tiling; the decode kernel is a proof-of-concept GQA kernel that doesn't use warp specialization or async copy. The point of tri-sds is not to beat SGLang on throughput, but to provide a fully transparent, Triton SD implementation where every line of the verify loop and every kernel is readable and modifiable.

Conclusion

tri-sds demonstrates that correct EAGLE-style speculative decoding can be implemented entirely in Triton, with acceptance rates matching SGLang's production backend. The ~30% throughput gap is the cost of transparency — and a ceiling that can be closed incrementally by fusing kernels, adding flash-attention prefill, and porting the decode kernel to use async-copy pipelines.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tri_sds		tri_sds
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tri-sds — Triton Speculative Decoding Inference

What is Speculative Decoding?

What this project is

Correctness Benchmarks

Qwen3 Benchmarks on AMD MI300X

SGLang baseline (ROCm 7.2, torch 2.9.1, BF16, Qwen3-4B→Qwen3-32B, MI300X×1)

tri-sds (same hardware, same model)

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tri-sds — Triton Speculative Decoding Inference

What is Speculative Decoding?

What this project is

Correctness Benchmarks

Qwen3 Benchmarks on AMD MI300X

SGLang baseline (ROCm 7.2, torch 2.9.1, BF16, Qwen3-4B→Qwen3-32B, MI300X×1)

tri-sds (same hardware, same model)

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages