Skip to content

feat: 10x SSD expert streaming speedup + speculative decoding for MoE on Apple Silicon#26

Merged
solderzzc merged 2 commits intoSharpAI:mainfrom
ericjlake:feat/ssd-streaming-10x
Apr 12, 2026
Merged

feat: 10x SSD expert streaming speedup + speculative decoding for MoE on Apple Silicon#26
solderzzc merged 2 commits intoSharpAI:mainfrom
ericjlake:feat/ssd-streaming-10x

Conversation

@ericjlake
Copy link
Copy Markdown
Contributor

Fixes #24

Summary

Rewrites the SSD expert streaming pipeline in mlx-swift-lm and adds speculative decoding infrastructure, achieving 10x generation speedup for large MoE models on memory-constrained Apple Silicon. Tested with Qwen3.5-122B-A10B-4bit (69.6 GB) and Qwen3.5-397B-A17B-4bit (209 GB) streaming expert weights from NVMe SSD on a 64GB M1 Ultra.

Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)

Configuration tok/s vs original
Original --stream-experts 0.58 baseline
This PR (top-k=8, full quality) 4.95 8.5x
This PR (top-k=6, default) 5.20 9.0x
This PR (top-k=4, speed mode) 5.91 10.2x
This PR (top-k=2, turbo mode) 6.52 11.2x

What changed

This PR (SwiftLM Server.swift)

  • --draft-model <path> CLI flag to load a second model for speculative decoding
  • --num-draft-tokens <n> to configure tokens per speculation round (default: 4)
  • Dual-model loading: draft model (e.g., 9B) fully in RAM + main model (e.g., 122B) in SSD streaming mode
  • Automatic routing to speculative vs. standard generation based on whether a draft model is loaded
  • Package.swift updated to pull SharpAI/mlx-swift-lm branch feat/ssd-streaming-10x

Companion dependency changes (SharpAI/mlx-swift-lm feat/ssd-streaming-10x)

The core optimizations live in 5 files across mlx-swift-lm. The branch with these changes needs to be created on SharpAI/mlx-swift-lm — full diff available in the NAS repo at /Volumes/Tend Well Life/code/repos/mlx-swift-lm (4 commits on top of main b71fad2).

SSD Streaming Optimizations:

  • SwitchLayers.swift (+240 lines): Cross-projection batching, concurrent pread via DispatchQueue.concurrentPerform (NVMe queue depth 1→24), persistent Metal buffers, asyncEval pipeline with speculative pread (~70% hit rate)
  • Qwen35.swift (+9 lines): Runtime SWIFTLM_TOP_K env var to reduce active experts per token

Speculative Decoding Infrastructure:

  • ModelContainer.swift (+52 lines): DraftModelRef (@unchecked Sendable wrapper), extractDraftModel(), speculative generate convenience method
  • KVCache.swift (+30 lines): MambaCache checkpoint/restore for hybrid Attention+Mamba architectures — enables speculative decoding on Qwen3.5
  • Evaluate.swift (+5 lines): Mamba state checkpointing in SpeculativeTokenIterator.speculateRound()

Key findings

  1. GPU compute is the bottleneck at steady state, not I/O. The OS page cache serves ~90% of expert reads from RAM. Per-token GPU compute is ~190ms of ~200ms total.
  2. Don't cache expert weights in application memory. An LRU cache stole from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it.
  3. Speculative decoding is counterproductive for SSD-streaming MoE. The verify pass sends N+1 tokens, each routing to different experts — SSD I/O scales with the union of all positions' expert selections. Works correctly for in-RAM models only.
  4. Mamba state requires checkpoint-based rollback. Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone.

Usage

# Standard SSD streaming (recommended, top-k=6):
SWIFTLM_TOP_K=6 SwiftLM --port 8002 \
  --model <path>/Qwen3.5-122B-A10B-4bit --stream-experts

# With speculative decoding (in-RAM models only):
SwiftLM --port 8002 \
  --model <path>/Qwen3.5-27B-4bit \
  --draft-model <path>/Qwen3.5-9B-4bit \
  --num-draft-tokens 4

Test plan

  • Qwen3.5-122B at top-k=8/6/4/2 with quality verification
  • Memory stable at ~10.6 GB resident, no swap activity
  • Speculative decoding compiles, loads dual models, generates tokens
  • MambaCache checkpoint/restore verified on hybrid Attention+Mamba
  • 9B draft benchmarked (73.65 tok/s), 27B benchmarked (24.58 tok/s)
  • Qwen3.5-397B-A17B-4bit benchmarked (0.56 tok/s, 209GB, 512 experts)

… on Apple Silicon

Rewrites the SSD expert streaming pipeline in mlx-swift-lm and adds
speculative decoding infrastructure, achieving 10x generation speedup
for large MoE models on memory-constrained Apple Silicon.

Server changes:
- --draft-model CLI flag to load a second model for speculative decoding
- --num-draft-tokens to configure tokens per speculation round
- Dual-model loading: draft model in RAM + main model in SSD streaming
- Automatic routing to speculative vs standard generation

Dependency: requires SharpAI/mlx-swift-lm feat/ssd-streaming-10x branch
with SSD streaming optimizations + MambaCache rollback.

Tested: Qwen3.5-122B-A10B-4bit 0.58->4.95 tok/s (8.5x) on M1 Ultra 64GB

Fixes SharpAI#24
@solderzzc
Copy link
Copy Markdown
Member

Hi, @ericjlake , thanks for your PR, this is amazing...
Would you like to check if there's code to be merged to https://github.com/SharpAI/mlx-swift-lm.git ?
error: could not find a branch named ‘feat/ssd-streaming-10x’ in https://github.com/SharpAI/mlx-swift-lm.git

@ericjlake
Copy link
Copy Markdown
Contributor Author

ericjlake commented Apr 12, 2026 via email

CI failed because SharpAI/mlx-swift-lm does not have feat/ssd-streaming-10x.
Point at ericjlake/mlx-swift-lm fork which has the branch with all
SSD streaming library changes (SwitchLayers, Qwen35, KVCache, Evaluate).
@ericjlake
Copy link
Copy Markdown
Contributor Author

CI fix pushed (dc40fb7) — Package.swift now points at ericjlake/mlx-swift-lm fork where feat/ssd-streaming-10x exists. The previous failure was SPM trying to resolve that branch from SharpAI/mlx-swift-lm which doesn't have it yet.

Run #65 is waiting for workflow approval — once approved it should resolve deps cleanly. The library changes (SwitchLayers, Qwen35, KVCache, Evaluate) are all on the fork branch. Happy to move the dependency to SharpAI/mlx-swift-lm once those changes are pushed there.

@solderzzc
Copy link
Copy Markdown
Member

@ericjlake Great, the CI passed. Do you want to PR ericjlake/mlx-swift-lm to https://github.com/SharpAI/mlx-swift-lm.git main branch?

@solderzzc
Copy link
Copy Markdown
Member

I have created PR SharpAI/mlx-swift-lm#9
Since this CI passed, let me merge it.

@solderzzc solderzzc merged commit c6e6212 into SharpAI:main Apr 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--stream-experts only achieves ~5% of available SSD bandwidth on M1 Ultra (300 MB/s observed vs 5-7 GB/s capable)

2 participants