perf: +8-9% on Qwen3.5-122B-A10B SSD-stream via stacked-buffer + Gate+Up fusion in mlx-swift-lm by ericjlake · Pull Request #90 · SharpAI/SwiftLM

ericjlake · 2026-04-27T00:43:16Z

Bumps the mlx-swift and mlx-swift-lm submodule pointers to bring in two opt-in env-gated optimizations on the SSD-streaming MoE path. All flags default OFF; existing N-buffer SSD-stream path is unchanged when env vars are unset.

Sibling to #26 — same shape (single SwiftLM PR with submodule pointer bump + linked dep branches on a personal fork). Same author. Closes the next ~10% of the headroom #26 left on the table.

Submodule changes

mlx-swift: 6b27940 → 761381f (+1 commit) — branch ericjlake/mlx-swift @ feat/preadIntoOffset — diff

feat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take — adds mlx_fast_pread_into_offset (writes one expert at a byte offset into a stacked tensor) plus the Swift wrapper MLXFast.preadIntoOffset. Additive only (+116 lines).

mlx-swift-lm: c154080 → 57ec366 (+13 commits, of which 10 are upstream-main drift catching up to b465 and 3 are this PR's payload) — branch ericjlake/mlx-swift-lm @ feat/stacked-moe-fastpath — diff vs upstream main

The 3 PR-payload commits on top of SharpAI/mlx-swift-lm@main (40d6b67, b465):

a2883a2 — feat(SwitchGLU): MLX_MOE_CACHE_SLOTS env tunable (+15 lines) — small static reader for MLX_MOE_CACHE_SLOTS=N, used by the new fast path. No behavior change on its own.
f432840 — feat(SwitchGLU): stacked-buffer SSD-stream fast path + computeExpertsFused (MLX_MOE_STACKED) (+314 lines) — MLX_MOE_STACKED=1 opts in; adds runStackedFastPath and QuantizedSwitchLinear.computeExpertsFused. Returns nil for any ineligibility so the caller falls through to the existing N-buffer path.
57ec366 — feat(SwitchGLU): Gate+Up SwiGLU matmul fusion (MLX_MOE_FUSE_GATEUP=1) (+161/-27 lines) — combines gate+up into one stacked buffer; one gatherQuantizedMM produces [..., 2*intermediate], halves split and fed into silu(g) * u.

How it works

MLX_MOE_STACKED=1 — allocate ONE [CACHE_SLOTS, intermediate, hidden] weight buffer per projection per layer; populate slots in place via MLXFast.preadIntoOffset; issue ONE gatherQuantizedMM per projection (rhsIndices = slotPerToken) instead of top_k separate dispatches per projection. Each Metal dispatch carries ~30 µs of CPU→GPU encode/submit overhead on Apple Silicon, which dominates per-token compute on SSD-streamed MoE.
MLX_MOE_FUSE_GATEUP=1 (requires MLX_MOE_STACKED=1) — collapses gate and up into one combined [CACHE_SLOTS, 2*intermediate, hidden] buffer; one gatherQuantizedMM produces [..., 2*intermediate], halves split, fed into silu(g) * u. Saves one projection-level dispatch per layer per token.
MLX_MOE_CACHE_SLOTS=N (default 16, min 6) — cache-slot tunable used by the new fast path.

Eligibility: the stacked path engages only when all 3 projections are quantized + resolveSSDInfo() succeeds + idx.size <= 32 (single-token decode). Ineligible layers and prompt batches return nil from runStackedFastPath and fall through to the existing path. There is no behavior change when both env flags are unset.

Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16)

Matched 600-token prompt, mean of 3 runs each:

Config	t/s	Δ vs legacy
upstream baseline (legacy N-buffer)	~5.12	—
`MLX_MOE_STACKED=1` only	~5.92 (per proposal doc)	+4.6%
`MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1`	~5.64	+10.2%

Variance across prompts is real (different routing patterns → different cache hit ratios); per-prompt spread for the same config is ~5%. The above is on the same prompt for all three configs.

Output quality verified coherent for both stacked and legacy paths.

How to test

git fetch origin pull/<this-PR>/head:test-stacked-moe
git checkout test-stacked-moe
git submodule update --init --recursive
swift build -c release --product SwiftLM --build-path /tmp/SwiftLM_build_test
MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 \
  MLX_MOE_CACHE_SLOTS=16 SWIFTLM_TOP_K=6 \
  /tmp/SwiftLM_build_test/arm64-apple-macosx/release/SwiftLM \
    --model <path>/Qwen3.5-122B-A10B-4bit \
    --port 8002 --stream-experts

With env flags unset (or set to 0), behavior is identical to upstream main.

Maintainer note

The submodule pointers reference commits currently hosted on ericjlake/mlx-swift and ericjlake/mlx-swift-lm. .gitmodules still points to the canonical SharpAI/* URLs (matching the repo layout). To get a clean merge with canonical-resolving submodules, the simplest sequence is:

Push my mlx-swift commit 761381f to SharpAI/mlx-swift (additive +116 lines, can be a fast-forward on main or a feature branch — your call).
Push my mlx-swift-lm commits a2883a2 → f432840 → 57ec366 to SharpAI/mlx-swift-lm (on top of main).
Merge this SwiftLM PR — git submodule update then resolves from the canonical URLs.

Or if you prefer a different sequence (e.g., maintainer-side cherry-pick or rebase), happy to adjust.

Test plan

Build succeeds with swift build -c release --product SwiftLM against the new submodule pointers.
Bench: +10.2% over legacy on a 600-tok matched prompt (5.12 → 5.64 t/s).
Legacy path smoke test (MLX_MOE_STACKED=0) still produces coherent output.
Auditor blocklist scan on SwitchLayers.swift (no leaked telemetry, pinning, PAPPS, prints, fatalError, internal task IDs, etc.).

References

Iteration log + per-stage analysis: docs/llm/122b_speedup_proposals.md in the NAS repo (private). Happy to mirror anywhere on request.
Cross-ref: closely related to the dispatch-overhead lever discussed in Qwen3-A3B full-RAM generation: 18 → 63 tok/s on M1 Ultra 64GB after needsMoeFlush gate + prompt-cache bleed fixes #84 / Update mlx-swift-lm Dependency to b413 #34 / fix(server): prompt-cache bleed fixes — MambaCache gate + ndim guard + spec-decode ordering #85 (full-RAM needsMoeFlush gate), but on the orthogonal SSD-streaming axis.

…+Up fusion in mlx-swift-lm Bumps the mlx-swift and mlx-swift-lm submodule pointers to bring in two opt-in env-gated optimizations on the SSD-streaming MoE path. All flags default OFF; existing N-buffer SSD-stream path is unchanged when env vars are unset. ## Submodule changes mlx-swift: 6b27940 -> 761381f (+1 commit) feat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take - Adds mlx_fast_pread_into_offset (writes one expert at a byte offset into a stacked tensor) plus Swift wrapper MLXFast.preadIntoOffset. - Branch: ericjlake/mlx-swift @ feat/preadIntoOffset mlx-swift-lm: c154080 -> 57ec366 (+13 commits, of which 10 are upstream main drift caught up to b465 and 3 are this PR's payload) - a2883a2 feat(SwitchGLU): MLX_MOE_CACHE_SLOTS env tunable - f432840 feat(SwitchGLU): stacked-buffer SSD-stream fast path + computeExpertsFused (MLX_MOE_STACKED) - 57ec366 feat(SwitchGLU): Gate+Up SwiGLU matmul fusion (MLX_MOE_FUSE_GATEUP=1) - Branch: ericjlake/mlx-swift-lm @ feat/stacked-moe-fastpath ## How it works * MLX_MOE_STACKED=1: allocate ONE [CACHE_SLOTS, intermediate, hidden] buffer per projection per layer; populate slots in-place via MLXFast.preadIntoOffset; issue ONE gatherQuantizedMM per projection (rhsIndices = slotPerToken) instead of top_k separate dispatches. Each Metal dispatch carries ~30 us of CPU->GPU encode/submit overhead on Apple Silicon, which dominates per-token compute on SSD-streamed MoE. * MLX_MOE_FUSE_GATEUP=1 (requires MLX_MOE_STACKED=1): collapse gate+up into a single combined [CACHE_SLOTS, 2*intermediate, hidden] buffer; one gatherQuantizedMM produces [..., 2*intermediate], split into halves and fed into silu(g) * u. Saves one projection-level dispatch per layer per token. * MLX_MOE_CACHE_SLOTS=N (default 16): cache-slot tunable used by the new fast path. Eligibility: stacked path engages only when all 3 projections are quantized + resolveSSDInfo() succeeds + idx.size <= 32 (single-token decode). Ineligible layers and prompt batches return nil and fall through to the existing path. ## Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16) Matched 600-token prompt: | Config | t/s | delta vs legacy | |----------------------------------------------|-------|-----------------| | upstream baseline (legacy N-buffer) | ~5.12 | - | | MLX_MOE_STACKED=1 | ~5.92 | +4.6% | | MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 | ~5.64 | +10.2% | Variance across prompts is real (different routing patterns -> different cache hit ratios). Numbers above are per-prompt means; the original proposal doc has full per-prompt tables. ## How to test # On a clean SwiftLM checkout of this branch: git submodule update --init --recursive swift build -c release --product SwiftLM MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 \ MLX_MOE_CACHE_SLOTS=16 SWIFTLM_TOP_K=6 \ .build/arm64-apple-macosx/release/SwiftLM \ --model <path>/Qwen3.5-122B-A10B-4bit \ --port 8002 --stream-experts With env flags unset, behavior is identical to upstream main. ## Maintainer note The submodule pointers reference commits currently hosted on ericjlake forks. Once you've reviewed and accepted them, you may want to cherry-pick / merge them into SharpAI/mlx-swift and SharpAI/mlx-swift-lm and re-run `git submodule update` here so the pointers resolve from the canonical SharpAI URLs. I left .gitmodules pointing to the canonical SharpAI repos for that reason. ## References proposal + iteration log: docs/llm/122b_speedup_proposals.md (in NAS repo; happy to mirror anywhere on request).

Pins `mlx-swift` and `mlx-swift-lm` to their respective latest `main` commits, pulling in the newly merged stacked-buffer MoE optimizations for SwiftLM. - SharpAI/mlx-swift#10 - SharpAI/mlx-swift-lm#35

This was referenced Apr 27, 2026

feat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take SharpAI/mlx-swift#10

Merged

perf(SwitchGLU): +10% SSD-stream MoE via stacked-buffer + Gate+Up fusion (MLX_MOE_STACKED / MLX_MOE_FUSE_GATEUP) SharpAI/mlx-swift-lm#35

Merged

build: sync submodules to latest main (SharpAI#10, SharpAI#35)

b235e47

Pins `mlx-swift` and `mlx-swift-lm` to their respective latest `main` commits, pulling in the newly merged stacked-buffer MoE optimizations for SwiftLM. - SharpAI/mlx-swift#10 - SharpAI/mlx-swift-lm#35

solderzzc merged commit 0ceaf20 into SharpAI:main Apr 27, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: +8-9% on Qwen3.5-122B-A10B SSD-stream via stacked-buffer + Gate+Up fusion in mlx-swift-lm#90

perf: +8-9% on Qwen3.5-122B-A10B SSD-stream via stacked-buffer + Gate+Up fusion in mlx-swift-lm#90
solderzzc merged 2 commits intoSharpAI:mainfrom
ericjlake:feat/stacked-moe-122b-perf

ericjlake commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericjlake commented Apr 27, 2026

Submodule changes

How it works

Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16)

How to test

Maintainer note

Test plan

References

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants