Feature Request: Integrate MTP Speculative Decoding (MTPLX-style) for 2x+ Speedup

## Feature Request: Integrate MTP Speculative Decoding for 2x+ Speedup

Hi! SwiftLM's SSD streaming and TurboQuant tech are already the most impressive in the MLX ecosystem. But I think there's a chance to push it even further.

### The Idea

[MTPLX](https://x.com/Youssofal_/status/2051435496551878847) is a native MTP (Multi-Token Prediction) speculative decoding engine for Apple Silicon that achieves **2.24× faster TPS**:
- Qwen3.6-27B on M5 Max: 28 tok/s → **63 tok/s**
- Uses the model's built-in MTP heads — no external drafter, no extra memory
- Non-greedy temperature sampling with rejection sampling (unlike greedy-only speculative decoding on Apple Silicon)
- Built on a patched MLX fork with custom Metal kernels

### Why SwiftLM + MTP Speculative Decoding Would Be Unbeatable

SwiftLM already has:
- ✅ SSD expert streaming (126 GB model on 64 GB RAM)
- ✅ TurboQuant KV cache compression
- ✅ Native Swift/Metal single binary, no GIL overhead
- ✅ Built-in speculative decoding framework (draft model support)

MTP speculative decoding would complement all of these:
- **For in-RAM models**: MTP heads replace the need for a separate draft model — 2x+ speedup with zero extra memory
- **For SSD-streamed models**: MTP reduces the number of round-trips to SSD per token, potentially amplifying the TurboQuant + SSD streaming advantage at long context
- **The Metal kernel work is already done** — MTPLX has custom Metal verify kernels, compiled verify graphs, and innovation-tape GDN rollback that could be adapted to Swift's MLX bindings

### Technical Opportunity

SwiftLM's existing speculative decoding support already has the verification pipeline. MTP heads would eliminate the need for a separate draft model entirely — the model generates its own drafts from the built-in MTP heads (Qwen3.x, Gemma 4, etc. already ship them).

### Why Now?

MTP is becoming a standard feature in modern LLM architectures. Qwen3.5/3.6, Gemma 4, and others all include MTP heads. Projects that can leverage these heads for speculative decoding will have a massive speed advantage, especially on Apple Silicon where memory bandwidth is the bottleneck.

### References

- MTPLX article: https://x.com/Youssofal_/status/2051435496551878847
- MTPLX technical details: custom Metal kernels, rejection sampling, compiled verify graphs

Would love to see this explored! SwiftLM's architecture seems like the best fit for this kind of deep Metal-level optimization in the MLX ecosystem. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Integrate MTP Speculative Decoding (MTPLX-style) for 2x+ Speedup #102

Feature Request: Integrate MTP Speculative Decoding for 2x+ Speedup

The Idea

Why SwiftLM + MTP Speculative Decoding Would Be Unbeatable

Technical Opportunity

Why Now?

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Integrate MTP Speculative Decoding (MTPLX-style) for 2x+ Speedup #102

Description

Feature Request: Integrate MTP Speculative Decoding for 2x+ Speedup

The Idea

Why SwiftLM + MTP Speculative Decoding Would Be Unbeatable

Technical Opportunity

Why Now?

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions