perf: +8-9% on Qwen3.5-122B-A10B SSD-stream via stacked-buffer + Gate+Up fusion in mlx-swift-lm#90
Merged
solderzzc merged 2 commits intoSharpAI:mainfrom Apr 27, 2026
Conversation
…+Up fusion in mlx-swift-lm
Bumps the mlx-swift and mlx-swift-lm submodule pointers to bring in
two opt-in env-gated optimizations on the SSD-streaming MoE path.
All flags default OFF; existing N-buffer SSD-stream path is
unchanged when env vars are unset.
## Submodule changes
mlx-swift: 6b27940 -> 761381f (+1 commit)
feat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take
- Adds mlx_fast_pread_into_offset (writes one expert at a byte offset
into a stacked tensor) plus Swift wrapper MLXFast.preadIntoOffset.
- Branch: ericjlake/mlx-swift @ feat/preadIntoOffset
mlx-swift-lm: c154080 -> 57ec366 (+13 commits, of which 10 are upstream
main drift caught up to b465 and 3 are this PR's payload)
- a2883a2 feat(SwitchGLU): MLX_MOE_CACHE_SLOTS env tunable
- f432840 feat(SwitchGLU): stacked-buffer SSD-stream fast path +
computeExpertsFused (MLX_MOE_STACKED)
- 57ec366 feat(SwitchGLU): Gate+Up SwiGLU matmul fusion
(MLX_MOE_FUSE_GATEUP=1)
- Branch: ericjlake/mlx-swift-lm @ feat/stacked-moe-fastpath
## How it works
* MLX_MOE_STACKED=1: allocate ONE [CACHE_SLOTS, intermediate, hidden]
buffer per projection per layer; populate slots in-place via
MLXFast.preadIntoOffset; issue ONE gatherQuantizedMM per projection
(rhsIndices = slotPerToken) instead of top_k separate dispatches.
Each Metal dispatch carries ~30 us of CPU->GPU encode/submit
overhead on Apple Silicon, which dominates per-token compute on
SSD-streamed MoE.
* MLX_MOE_FUSE_GATEUP=1 (requires MLX_MOE_STACKED=1): collapse gate+up
into a single combined [CACHE_SLOTS, 2*intermediate, hidden] buffer;
one gatherQuantizedMM produces [..., 2*intermediate], split into
halves and fed into silu(g) * u. Saves one projection-level dispatch
per layer per token.
* MLX_MOE_CACHE_SLOTS=N (default 16): cache-slot tunable used by the
new fast path.
Eligibility: stacked path engages only when all 3 projections are
quantized + resolveSSDInfo() succeeds + idx.size <= 32 (single-token
decode). Ineligible layers and prompt batches return nil and fall
through to the existing path.
## Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16)
Matched 600-token prompt:
| Config | t/s | delta vs legacy |
|----------------------------------------------|-------|-----------------|
| upstream baseline (legacy N-buffer) | ~5.12 | - |
| MLX_MOE_STACKED=1 | ~5.92 | +4.6% |
| MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 | ~5.64 | +10.2% |
Variance across prompts is real (different routing patterns ->
different cache hit ratios). Numbers above are per-prompt means; the
original proposal doc has full per-prompt tables.
## How to test
# On a clean SwiftLM checkout of this branch:
git submodule update --init --recursive
swift build -c release --product SwiftLM
MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 \
MLX_MOE_CACHE_SLOTS=16 SWIFTLM_TOP_K=6 \
.build/arm64-apple-macosx/release/SwiftLM \
--model <path>/Qwen3.5-122B-A10B-4bit \
--port 8002 --stream-experts
With env flags unset, behavior is identical to upstream main.
## Maintainer note
The submodule pointers reference commits currently hosted on
ericjlake forks. Once you've reviewed and accepted them, you may want
to cherry-pick / merge them into SharpAI/mlx-swift and
SharpAI/mlx-swift-lm and re-run `git submodule update` here so the
pointers resolve from the canonical SharpAI URLs. I left .gitmodules
pointing to the canonical SharpAI repos for that reason.
## References
proposal + iteration log: docs/llm/122b_speedup_proposals.md (in NAS
repo; happy to mirror anywhere on request).
Pins `mlx-swift` and `mlx-swift-lm` to their respective latest `main` commits, pulling in the newly merged stacked-buffer MoE optimizations for SwiftLM. - SharpAI/mlx-swift#10 - SharpAI/mlx-swift-lm#35
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps the
mlx-swiftandmlx-swift-lmsubmodule pointers to bring in two opt-in env-gated optimizations on the SSD-streaming MoE path. All flags default OFF; existing N-buffer SSD-stream path is unchanged when env vars are unset.Sibling to #26 — same shape (single SwiftLM PR with submodule pointer bump + linked dep branches on a personal fork). Same author. Closes the next ~10% of the headroom #26 left on the table.
Submodule changes
mlx-swift:6b27940→761381f(+1 commit) — branchericjlake/mlx-swift @ feat/preadIntoOffset— difffeat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take— addsmlx_fast_pread_into_offset(writes one expert at a byte offset into a stacked tensor) plus the Swift wrapperMLXFast.preadIntoOffset. Additive only (+116 lines).mlx-swift-lm:c154080→57ec366(+13 commits, of which 10 are upstream-main drift catching up to b465 and 3 are this PR's payload) — branchericjlake/mlx-swift-lm @ feat/stacked-moe-fastpath— diff vs upstream mainThe 3 PR-payload commits on top of
SharpAI/mlx-swift-lm@main(40d6b67, b465):a2883a2—feat(SwitchGLU): MLX_MOE_CACHE_SLOTS env tunable(+15 lines) — small static reader forMLX_MOE_CACHE_SLOTS=N, used by the new fast path. No behavior change on its own.f432840—feat(SwitchGLU): stacked-buffer SSD-stream fast path + computeExpertsFused (MLX_MOE_STACKED)(+314 lines) —MLX_MOE_STACKED=1opts in; addsrunStackedFastPathandQuantizedSwitchLinear.computeExpertsFused. Returnsnilfor any ineligibility so the caller falls through to the existing N-buffer path.57ec366—feat(SwitchGLU): Gate+Up SwiGLU matmul fusion (MLX_MOE_FUSE_GATEUP=1)(+161/-27 lines) — combines gate+up into one stacked buffer; onegatherQuantizedMMproduces[..., 2*intermediate], halves split and fed intosilu(g) * u.How it works
MLX_MOE_STACKED=1— allocate ONE[CACHE_SLOTS, intermediate, hidden]weight buffer per projection per layer; populate slots in place viaMLXFast.preadIntoOffset; issue ONEgatherQuantizedMMper projection (rhsIndices = slotPerToken) instead oftop_kseparate dispatches per projection. Each Metal dispatch carries ~30 µs of CPU→GPU encode/submit overhead on Apple Silicon, which dominates per-token compute on SSD-streamed MoE.MLX_MOE_FUSE_GATEUP=1(requiresMLX_MOE_STACKED=1) — collapses gate and up into one combined[CACHE_SLOTS, 2*intermediate, hidden]buffer; onegatherQuantizedMMproduces[..., 2*intermediate], halves split, fed intosilu(g) * u. Saves one projection-level dispatch per layer per token.MLX_MOE_CACHE_SLOTS=N(default 16, min 6) — cache-slot tunable used by the new fast path.Eligibility: the stacked path engages only when all 3 projections are quantized +
resolveSSDInfo()succeeds +idx.size <= 32(single-token decode). Ineligible layers and prompt batches returnnilfromrunStackedFastPathand fall through to the existing path. There is no behavior change when both env flags are unset.Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16)
Matched 600-token prompt, mean of 3 runs each:
MLX_MOE_STACKED=1onlyMLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1Variance across prompts is real (different routing patterns → different cache hit ratios); per-prompt spread for the same config is ~5%. The above is on the same prompt for all three configs.
Output quality verified coherent for both stacked and legacy paths.
How to test
With env flags unset (or set to
0), behavior is identical to upstream main.Maintainer note
The submodule pointers reference commits currently hosted on
ericjlake/mlx-swiftandericjlake/mlx-swift-lm..gitmodulesstill points to the canonicalSharpAI/*URLs (matching the repo layout). To get a clean merge with canonical-resolving submodules, the simplest sequence is:mlx-swiftcommit761381ftoSharpAI/mlx-swift(additive +116 lines, can be a fast-forward onmainor a feature branch — your call).mlx-swift-lmcommitsa2883a2 → f432840 → 57ec366toSharpAI/mlx-swift-lm(on top ofmain).git submodule updatethen resolves from the canonical URLs.Or if you prefer a different sequence (e.g., maintainer-side cherry-pick or rebase), happy to adjust.
Test plan
swift build -c release --product SwiftLMagainst the new submodule pointers.+10.2%over legacy on a 600-tok matched prompt (5.12 → 5.64 t/s).MLX_MOE_STACKED=0) still produces coherent output.SwitchLayers.swift(no leaked telemetry, pinning, PAPPS, prints, fatalError, internal task IDs, etc.).References
docs/llm/122b_speedup_proposals.mdin the NAS repo (private). Happy to mirror anywhere on request.needsMoeFlushgate + prompt-cache bleed fixes #84 / Update mlx-swift-lm Dependency to b413 #34 / fix(server): prompt-cache bleed fixes — MambaCache gate + ndim guard + spec-decode ordering #85 (full-RAMneedsMoeFlushgate), but on the orthogonal SSD-streaming axis.