Feature Request: Integrate MTP Speculative Decoding for 2x+ Speedup
Hi! SwiftLM's SSD streaming and TurboQuant tech are already the most impressive in the MLX ecosystem. But I think there's a chance to push it even further.
The Idea
MTPLX is a native MTP (Multi-Token Prediction) speculative decoding engine for Apple Silicon that achieves 2.24× faster TPS:
- Qwen3.6-27B on M5 Max: 28 tok/s → 63 tok/s
- Uses the model's built-in MTP heads — no external drafter, no extra memory
- Non-greedy temperature sampling with rejection sampling (unlike greedy-only speculative decoding on Apple Silicon)
- Built on a patched MLX fork with custom Metal kernels
Why SwiftLM + MTP Speculative Decoding Would Be Unbeatable
SwiftLM already has:
- ✅ SSD expert streaming (126 GB model on 64 GB RAM)
- ✅ TurboQuant KV cache compression
- ✅ Native Swift/Metal single binary, no GIL overhead
- ✅ Built-in speculative decoding framework (draft model support)
MTP speculative decoding would complement all of these:
- For in-RAM models: MTP heads replace the need for a separate draft model — 2x+ speedup with zero extra memory
- For SSD-streamed models: MTP reduces the number of round-trips to SSD per token, potentially amplifying the TurboQuant + SSD streaming advantage at long context
- The Metal kernel work is already done — MTPLX has custom Metal verify kernels, compiled verify graphs, and innovation-tape GDN rollback that could be adapted to Swift's MLX bindings
Technical Opportunity
SwiftLM's existing speculative decoding support already has the verification pipeline. MTP heads would eliminate the need for a separate draft model entirely — the model generates its own drafts from the built-in MTP heads (Qwen3.x, Gemma 4, etc. already ship them).
Why Now?
MTP is becoming a standard feature in modern LLM architectures. Qwen3.5/3.6, Gemma 4, and others all include MTP heads. Projects that can leverage these heads for speculative decoding will have a massive speed advantage, especially on Apple Silicon where memory bandwidth is the bottleneck.
References
Would love to see this explored! SwiftLM's architecture seems like the best fit for this kind of deep Metal-level optimization in the MLX ecosystem. 🚀
Feature Request: Integrate MTP Speculative Decoding for 2x+ Speedup
Hi! SwiftLM's SSD streaming and TurboQuant tech are already the most impressive in the MLX ecosystem. But I think there's a chance to push it even further.
The Idea
MTPLX is a native MTP (Multi-Token Prediction) speculative decoding engine for Apple Silicon that achieves 2.24× faster TPS:
Why SwiftLM + MTP Speculative Decoding Would Be Unbeatable
SwiftLM already has:
MTP speculative decoding would complement all of these:
Technical Opportunity
SwiftLM's existing speculative decoding support already has the verification pipeline. MTP heads would eliminate the need for a separate draft model entirely — the model generates its own drafts from the built-in MTP heads (Qwen3.x, Gemma 4, etc. already ship them).
Why Now?
MTP is becoming a standard feature in modern LLM architectures. Qwen3.5/3.6, Gemma 4, and others all include MTP heads. Projects that can leverage these heads for speculative decoding will have a massive speed advantage, especially on Apple Silicon where memory bandwidth is the bottleneck.
References
Would love to see this explored! SwiftLM's architecture seems like the best fit for this kind of deep Metal-level optimization in the MLX ecosystem. 🚀