feat: 10x SSD expert streaming speedup + speculative decoding for MoE on Apple Silicon#26
Conversation
… on Apple Silicon Rewrites the SSD expert streaming pipeline in mlx-swift-lm and adds speculative decoding infrastructure, achieving 10x generation speedup for large MoE models on memory-constrained Apple Silicon. Server changes: - --draft-model CLI flag to load a second model for speculative decoding - --num-draft-tokens to configure tokens per speculation round - Dual-model loading: draft model in RAM + main model in SSD streaming - Automatic routing to speculative vs standard generation Dependency: requires SharpAI/mlx-swift-lm feat/ssd-streaming-10x branch with SSD streaming optimizations + MambaCache rollback. Tested: Qwen3.5-122B-A10B-4bit 0.58->4.95 tok/s (8.5x) on M1 Ultra 64GB Fixes SharpAI#24
|
Hi, @ericjlake , thanks for your PR, this is amazing... |
|
I’m checking on it now
…On Sat, Apr 11, 2026 at 8:27 PM Simba ***@***.***> wrote:
*solderzzc* left a comment (SharpAI/SwiftLM#26)
<#26 (comment)>
Hi, @ericjlake <https://github.com/ericjlake> , thanks for your PR, this
is amazing...
Would you like to check if there's code to be merged to
https://github.com/SharpAI/mlx-swift-lm.git ?
error: could not find a branch named ‘feat/ssd-streaming-10x’ in
https://github.com/SharpAI/mlx-swift-lm.git
—
Reply to this email directly, view it on GitHub
<#26 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/B5TD3GJMY47MUZ4MRBXX2J34VMEJ7AVCNFSM6AAAAACXVKJ4HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEMZQGY2TINZTGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
CI failed because SharpAI/mlx-swift-lm does not have feat/ssd-streaming-10x. Point at ericjlake/mlx-swift-lm fork which has the branch with all SSD streaming library changes (SwitchLayers, Qwen35, KVCache, Evaluate).
|
CI fix pushed (dc40fb7) — Run #65 is waiting for workflow approval — once approved it should resolve deps cleanly. The library changes (SwitchLayers, Qwen35, KVCache, Evaluate) are all on the fork branch. Happy to move the dependency to SharpAI/mlx-swift-lm once those changes are pushed there. |
|
@ericjlake Great, the CI passed. Do you want to PR ericjlake/mlx-swift-lm to https://github.com/SharpAI/mlx-swift-lm.git main branch? |
|
I have created PR SharpAI/mlx-swift-lm#9 |
Fixes #24
Summary
Rewrites the SSD expert streaming pipeline in mlx-swift-lm and adds speculative decoding infrastructure, achieving 10x generation speedup for large MoE models on memory-constrained Apple Silicon. Tested with Qwen3.5-122B-A10B-4bit (69.6 GB) and Qwen3.5-397B-A17B-4bit (209 GB) streaming expert weights from NVMe SSD on a 64GB M1 Ultra.
Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)
--stream-expertsWhat changed
This PR (SwiftLM Server.swift)
--draft-model <path>CLI flag to load a second model for speculative decoding--num-draft-tokens <n>to configure tokens per speculation round (default: 4)SharpAI/mlx-swift-lmbranchfeat/ssd-streaming-10xCompanion dependency changes (SharpAI/mlx-swift-lm
feat/ssd-streaming-10x)The core optimizations live in 5 files across mlx-swift-lm. The branch with these changes needs to be created on
SharpAI/mlx-swift-lm— full diff available in the NAS repo at/Volumes/Tend Well Life/code/repos/mlx-swift-lm(4 commits on top of mainb71fad2).SSD Streaming Optimizations:
SwitchLayers.swift(+240 lines): Cross-projection batching, concurrent pread viaDispatchQueue.concurrentPerform(NVMe queue depth 1→24), persistent Metal buffers, asyncEval pipeline with speculative pread (~70% hit rate)Qwen35.swift(+9 lines): RuntimeSWIFTLM_TOP_Kenv var to reduce active experts per tokenSpeculative Decoding Infrastructure:
ModelContainer.swift(+52 lines):DraftModelRef(@unchecked Sendablewrapper),extractDraftModel(), speculative generate convenience methodKVCache.swift(+30 lines):MambaCachecheckpoint/restore for hybrid Attention+Mamba architectures — enables speculative decoding on Qwen3.5Evaluate.swift(+5 lines): Mamba state checkpointing inSpeculativeTokenIterator.speculateRound()Key findings
Usage
Test plan