Skip to content

Feature Request: Integrate MTP Speculative Decoding (MTPLX-style) for 2x+ Speedup #102

@Suidge

Description

@Suidge

Feature Request: Integrate MTP Speculative Decoding for 2x+ Speedup

Hi! SwiftLM's SSD streaming and TurboQuant tech are already the most impressive in the MLX ecosystem. But I think there's a chance to push it even further.

The Idea

MTPLX is a native MTP (Multi-Token Prediction) speculative decoding engine for Apple Silicon that achieves 2.24× faster TPS:

  • Qwen3.6-27B on M5 Max: 28 tok/s → 63 tok/s
  • Uses the model's built-in MTP heads — no external drafter, no extra memory
  • Non-greedy temperature sampling with rejection sampling (unlike greedy-only speculative decoding on Apple Silicon)
  • Built on a patched MLX fork with custom Metal kernels

Why SwiftLM + MTP Speculative Decoding Would Be Unbeatable

SwiftLM already has:

  • ✅ SSD expert streaming (126 GB model on 64 GB RAM)
  • ✅ TurboQuant KV cache compression
  • ✅ Native Swift/Metal single binary, no GIL overhead
  • ✅ Built-in speculative decoding framework (draft model support)

MTP speculative decoding would complement all of these:

  • For in-RAM models: MTP heads replace the need for a separate draft model — 2x+ speedup with zero extra memory
  • For SSD-streamed models: MTP reduces the number of round-trips to SSD per token, potentially amplifying the TurboQuant + SSD streaming advantage at long context
  • The Metal kernel work is already done — MTPLX has custom Metal verify kernels, compiled verify graphs, and innovation-tape GDN rollback that could be adapted to Swift's MLX bindings

Technical Opportunity

SwiftLM's existing speculative decoding support already has the verification pipeline. MTP heads would eliminate the need for a separate draft model entirely — the model generates its own drafts from the built-in MTP heads (Qwen3.x, Gemma 4, etc. already ship them).

Why Now?

MTP is becoming a standard feature in modern LLM architectures. Qwen3.5/3.6, Gemma 4, and others all include MTP heads. Projects that can leverage these heads for speculative decoding will have a massive speed advantage, especially on Apple Silicon where memory bandwidth is the bottleneck.

References

Would love to see this explored! SwiftLM's architecture seems like the best fit for this kind of deep Metal-level optimization in the MLX ecosystem. 🚀

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions