feat: 10x SSD expert streaming speedup + speculative decoding for MoE on Apple Silicon by ericjlake · Pull Request #26 · SharpAI/SwiftLM

ericjlake · 2026-04-12T02:23:57Z

Fixes #24

Summary

Rewrites the SSD expert streaming pipeline in mlx-swift-lm and adds speculative decoding infrastructure, achieving 10x generation speedup for large MoE models on memory-constrained Apple Silicon. Tested with Qwen3.5-122B-A10B-4bit (69.6 GB) and Qwen3.5-397B-A17B-4bit (209 GB) streaming expert weights from NVMe SSD on a 64GB M1 Ultra.

Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)

Configuration	tok/s	vs original
Original `--stream-experts`	0.58	baseline
This PR (top-k=8, full quality)	4.95	8.5x
This PR (top-k=6, default)	5.20	9.0x
This PR (top-k=4, speed mode)	5.91	10.2x
This PR (top-k=2, turbo mode)	6.52	11.2x

What changed

This PR (SwiftLM Server.swift)

--draft-model <path> CLI flag to load a second model for speculative decoding
--num-draft-tokens <n> to configure tokens per speculation round (default: 4)
Dual-model loading: draft model (e.g., 9B) fully in RAM + main model (e.g., 122B) in SSD streaming mode
Automatic routing to speculative vs. standard generation based on whether a draft model is loaded
Package.swift updated to pull SharpAI/mlx-swift-lm branch feat/ssd-streaming-10x

Companion dependency changes (SharpAI/mlx-swift-lm `feat/ssd-streaming-10x`)

The core optimizations live in 5 files across mlx-swift-lm. The branch with these changes needs to be created on SharpAI/mlx-swift-lm — full diff available in the NAS repo at /Volumes/Tend Well Life/code/repos/mlx-swift-lm (4 commits on top of main b71fad2).

SSD Streaming Optimizations:

SwitchLayers.swift (+240 lines): Cross-projection batching, concurrent pread via DispatchQueue.concurrentPerform (NVMe queue depth 1→24), persistent Metal buffers, asyncEval pipeline with speculative pread (~70% hit rate)
Qwen35.swift (+9 lines): Runtime SWIFTLM_TOP_K env var to reduce active experts per token

Speculative Decoding Infrastructure:

ModelContainer.swift (+52 lines): DraftModelRef (@unchecked Sendable wrapper), extractDraftModel(), speculative generate convenience method
KVCache.swift (+30 lines): MambaCache checkpoint/restore for hybrid Attention+Mamba architectures — enables speculative decoding on Qwen3.5
Evaluate.swift (+5 lines): Mamba state checkpointing in SpeculativeTokenIterator.speculateRound()

Key findings

GPU compute is the bottleneck at steady state, not I/O. The OS page cache serves ~90% of expert reads from RAM. Per-token GPU compute is ~190ms of ~200ms total.
Don't cache expert weights in application memory. An LRU cache stole from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it.
Speculative decoding is counterproductive for SSD-streaming MoE. The verify pass sends N+1 tokens, each routing to different experts — SSD I/O scales with the union of all positions' expert selections. Works correctly for in-RAM models only.
Mamba state requires checkpoint-based rollback. Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone.

Usage

# Standard SSD streaming (recommended, top-k=6):
SWIFTLM_TOP_K=6 SwiftLM --port 8002 \
  --model <path>/Qwen3.5-122B-A10B-4bit --stream-experts

# With speculative decoding (in-RAM models only):
SwiftLM --port 8002 \
  --model <path>/Qwen3.5-27B-4bit \
  --draft-model <path>/Qwen3.5-9B-4bit \
  --num-draft-tokens 4

Test plan

Qwen3.5-122B at top-k=8/6/4/2 with quality verification
Memory stable at ~10.6 GB resident, no swap activity
Speculative decoding compiles, loads dual models, generates tokens
MambaCache checkpoint/restore verified on hybrid Attention+Mamba
9B draft benchmarked (73.65 tok/s), 27B benchmarked (24.58 tok/s)
Qwen3.5-397B-A17B-4bit benchmarked (0.56 tok/s, 209GB, 512 experts)

… on Apple Silicon Rewrites the SSD expert streaming pipeline in mlx-swift-lm and adds speculative decoding infrastructure, achieving 10x generation speedup for large MoE models on memory-constrained Apple Silicon. Server changes: - --draft-model CLI flag to load a second model for speculative decoding - --num-draft-tokens to configure tokens per speculation round - Dual-model loading: draft model in RAM + main model in SSD streaming - Automatic routing to speculative vs standard generation Dependency: requires SharpAI/mlx-swift-lm feat/ssd-streaming-10x branch with SSD streaming optimizations + MambaCache rollback. Tested: Qwen3.5-122B-A10B-4bit 0.58->4.95 tok/s (8.5x) on M1 Ultra 64GB Fixes SharpAI#24

solderzzc · 2026-04-12T03:27:06Z

Hi, @ericjlake , thanks for your PR, this is amazing...
Would you like to check if there's code to be merged to https://github.com/SharpAI/mlx-swift-lm.git ?
error: could not find a branch named ‘feat/ssd-streaming-10x’ in https://github.com/SharpAI/mlx-swift-lm.git

ericjlake · 2026-04-12T03:28:53Z

I’m checking on it now

…

On Sat, Apr 11, 2026 at 8:27 PM Simba ***@***.***> wrote: *solderzzc* left a comment (SharpAI/SwiftLM#26) <#26 (comment)> Hi, @ericjlake <https://github.com/ericjlake> , thanks for your PR, this is amazing... Would you like to check if there's code to be merged to https://github.com/SharpAI/mlx-swift-lm.git ? error: could not find a branch named ‘feat/ssd-streaming-10x’ in https://github.com/SharpAI/mlx-swift-lm.git — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/B5TD3GJMY47MUZ4MRBXX2J34VMEJ7AVCNFSM6AAAAACXVKJ4HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEMZQGY2TINZTGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

CI failed because SharpAI/mlx-swift-lm does not have feat/ssd-streaming-10x. Point at ericjlake/mlx-swift-lm fork which has the branch with all SSD streaming library changes (SwitchLayers, Qwen35, KVCache, Evaluate).

ericjlake · 2026-04-12T03:31:01Z

CI fix pushed (dc40fb7) — Package.swift now points at ericjlake/mlx-swift-lm fork where feat/ssd-streaming-10x exists. The previous failure was SPM trying to resolve that branch from SharpAI/mlx-swift-lm which doesn't have it yet.

Run #65 is waiting for workflow approval — once approved it should resolve deps cleanly. The library changes (SwitchLayers, Qwen35, KVCache, Evaluate) are all on the fork branch. Happy to move the dependency to SharpAI/mlx-swift-lm once those changes are pushed there.

solderzzc · 2026-04-12T03:49:43Z

@ericjlake Great, the CI passed. Do you want to PR ericjlake/mlx-swift-lm to https://github.com/SharpAI/mlx-swift-lm.git main branch?

solderzzc · 2026-04-12T04:09:32Z

I have created PR SharpAI/mlx-swift-lm#9
Since this CI passed, let me merge it.

solderzzc merged commit c6e6212 into SharpAI:main Apr 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 10x SSD expert streaming speedup + speculative decoding for MoE on Apple Silicon#26

feat: 10x SSD expert streaming speedup + speculative decoding for MoE on Apple Silicon#26
solderzzc merged 2 commits intoSharpAI:mainfrom
ericjlake:feat/ssd-streaming-10x

ericjlake commented Apr 12, 2026

Uh oh!

solderzzc commented Apr 12, 2026

Uh oh!

ericjlake commented Apr 12, 2026 via email

Uh oh!

ericjlake commented Apr 12, 2026

Uh oh!

solderzzc commented Apr 12, 2026

Uh oh!

solderzzc commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericjlake commented Apr 12, 2026

Summary

Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)

What changed

This PR (SwiftLM Server.swift)

Companion dependency changes (SharpAI/mlx-swift-lm feat/ssd-streaming-10x)

Key findings

Usage

Test plan

Uh oh!

solderzzc commented Apr 12, 2026

Uh oh!

ericjlake commented Apr 12, 2026 via email

Uh oh!

ericjlake commented Apr 12, 2026

Uh oh!

solderzzc commented Apr 12, 2026

Uh oh!

solderzzc commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Companion dependency changes (SharpAI/mlx-swift-lm `feat/ssd-streaming-10x`)