Skip to content

perf: +8-9% on Qwen3.5-122B-A10B SSD-stream via stacked-buffer + Gate+Up fusion in mlx-swift-lm#90

Merged
solderzzc merged 2 commits intoSharpAI:mainfrom
ericjlake:feat/stacked-moe-122b-perf
Apr 27, 2026
Merged

perf: +8-9% on Qwen3.5-122B-A10B SSD-stream via stacked-buffer + Gate+Up fusion in mlx-swift-lm#90
solderzzc merged 2 commits intoSharpAI:mainfrom
ericjlake:feat/stacked-moe-122b-perf

Conversation

@ericjlake
Copy link
Copy Markdown
Contributor

Bumps the mlx-swift and mlx-swift-lm submodule pointers to bring in two opt-in env-gated optimizations on the SSD-streaming MoE path. All flags default OFF; existing N-buffer SSD-stream path is unchanged when env vars are unset.

Sibling to #26 — same shape (single SwiftLM PR with submodule pointer bump + linked dep branches on a personal fork). Same author. Closes the next ~10% of the headroom #26 left on the table.

Submodule changes

mlx-swift: 6b27940761381f (+1 commit) — branch ericjlake/mlx-swift @ feat/preadIntoOffsetdiff

  • feat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take — adds mlx_fast_pread_into_offset (writes one expert at a byte offset into a stacked tensor) plus the Swift wrapper MLXFast.preadIntoOffset. Additive only (+116 lines).

mlx-swift-lm: c15408057ec366 (+13 commits, of which 10 are upstream-main drift catching up to b465 and 3 are this PR's payload) — branch ericjlake/mlx-swift-lm @ feat/stacked-moe-fastpathdiff vs upstream main

The 3 PR-payload commits on top of SharpAI/mlx-swift-lm@main (40d6b67, b465):

  1. a2883a2feat(SwitchGLU): MLX_MOE_CACHE_SLOTS env tunable (+15 lines) — small static reader for MLX_MOE_CACHE_SLOTS=N, used by the new fast path. No behavior change on its own.
  2. f432840feat(SwitchGLU): stacked-buffer SSD-stream fast path + computeExpertsFused (MLX_MOE_STACKED) (+314 lines) — MLX_MOE_STACKED=1 opts in; adds runStackedFastPath and QuantizedSwitchLinear.computeExpertsFused. Returns nil for any ineligibility so the caller falls through to the existing N-buffer path.
  3. 57ec366feat(SwitchGLU): Gate+Up SwiGLU matmul fusion (MLX_MOE_FUSE_GATEUP=1) (+161/-27 lines) — combines gate+up into one stacked buffer; one gatherQuantizedMM produces [..., 2*intermediate], halves split and fed into silu(g) * u.

How it works

  • MLX_MOE_STACKED=1 — allocate ONE [CACHE_SLOTS, intermediate, hidden] weight buffer per projection per layer; populate slots in place via MLXFast.preadIntoOffset; issue ONE gatherQuantizedMM per projection (rhsIndices = slotPerToken) instead of top_k separate dispatches per projection. Each Metal dispatch carries ~30 µs of CPU→GPU encode/submit overhead on Apple Silicon, which dominates per-token compute on SSD-streamed MoE.
  • MLX_MOE_FUSE_GATEUP=1 (requires MLX_MOE_STACKED=1) — collapses gate and up into one combined [CACHE_SLOTS, 2*intermediate, hidden] buffer; one gatherQuantizedMM produces [..., 2*intermediate], halves split, fed into silu(g) * u. Saves one projection-level dispatch per layer per token.
  • MLX_MOE_CACHE_SLOTS=N (default 16, min 6) — cache-slot tunable used by the new fast path.

Eligibility: the stacked path engages only when all 3 projections are quantized + resolveSSDInfo() succeeds + idx.size <= 32 (single-token decode). Ineligible layers and prompt batches return nil from runStackedFastPath and fall through to the existing path. There is no behavior change when both env flags are unset.

Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16)

Matched 600-token prompt, mean of 3 runs each:

Config t/s Δ vs legacy
upstream baseline (legacy N-buffer) ~5.12
MLX_MOE_STACKED=1 only ~5.92 (per proposal doc) +4.6%
MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 ~5.64 +10.2%

Variance across prompts is real (different routing patterns → different cache hit ratios); per-prompt spread for the same config is ~5%. The above is on the same prompt for all three configs.

Output quality verified coherent for both stacked and legacy paths.

How to test

git fetch origin pull/<this-PR>/head:test-stacked-moe
git checkout test-stacked-moe
git submodule update --init --recursive
swift build -c release --product SwiftLM --build-path /tmp/SwiftLM_build_test
MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 \
  MLX_MOE_CACHE_SLOTS=16 SWIFTLM_TOP_K=6 \
  /tmp/SwiftLM_build_test/arm64-apple-macosx/release/SwiftLM \
    --model <path>/Qwen3.5-122B-A10B-4bit \
    --port 8002 --stream-experts

With env flags unset (or set to 0), behavior is identical to upstream main.

Maintainer note

The submodule pointers reference commits currently hosted on ericjlake/mlx-swift and ericjlake/mlx-swift-lm. .gitmodules still points to the canonical SharpAI/* URLs (matching the repo layout). To get a clean merge with canonical-resolving submodules, the simplest sequence is:

  1. Push my mlx-swift commit 761381f to SharpAI/mlx-swift (additive +116 lines, can be a fast-forward on main or a feature branch — your call).
  2. Push my mlx-swift-lm commits a2883a2 → f432840 → 57ec366 to SharpAI/mlx-swift-lm (on top of main).
  3. Merge this SwiftLM PR — git submodule update then resolves from the canonical URLs.

Or if you prefer a different sequence (e.g., maintainer-side cherry-pick or rebase), happy to adjust.

Test plan

  • Build succeeds with swift build -c release --product SwiftLM against the new submodule pointers.
  • Bench: +10.2% over legacy on a 600-tok matched prompt (5.12 → 5.64 t/s).
  • Legacy path smoke test (MLX_MOE_STACKED=0) still produces coherent output.
  • Auditor blocklist scan on SwitchLayers.swift (no leaked telemetry, pinning, PAPPS, prints, fatalError, internal task IDs, etc.).

References

…+Up fusion in mlx-swift-lm

Bumps the mlx-swift and mlx-swift-lm submodule pointers to bring in
two opt-in env-gated optimizations on the SSD-streaming MoE path.
All flags default OFF; existing N-buffer SSD-stream path is
unchanged when env vars are unset.

## Submodule changes

mlx-swift: 6b27940 -> 761381f (+1 commit)
  feat(fast): preadIntoOffset for stacked-buffer MoE consumers + PAPPS try_take
  - Adds mlx_fast_pread_into_offset (writes one expert at a byte offset
    into a stacked tensor) plus Swift wrapper MLXFast.preadIntoOffset.
  - Branch: ericjlake/mlx-swift @ feat/preadIntoOffset

mlx-swift-lm: c154080 -> 57ec366 (+13 commits, of which 10 are upstream
main drift caught up to b465 and 3 are this PR's payload)
  - a2883a2 feat(SwitchGLU): MLX_MOE_CACHE_SLOTS env tunable
  - f432840 feat(SwitchGLU): stacked-buffer SSD-stream fast path +
           computeExpertsFused (MLX_MOE_STACKED)
  - 57ec366 feat(SwitchGLU): Gate+Up SwiGLU matmul fusion
           (MLX_MOE_FUSE_GATEUP=1)
  - Branch: ericjlake/mlx-swift-lm @ feat/stacked-moe-fastpath

## How it works

* MLX_MOE_STACKED=1: allocate ONE [CACHE_SLOTS, intermediate, hidden]
  buffer per projection per layer; populate slots in-place via
  MLXFast.preadIntoOffset; issue ONE gatherQuantizedMM per projection
  (rhsIndices = slotPerToken) instead of top_k separate dispatches.
  Each Metal dispatch carries ~30 us of CPU->GPU encode/submit
  overhead on Apple Silicon, which dominates per-token compute on
  SSD-streamed MoE.
* MLX_MOE_FUSE_GATEUP=1 (requires MLX_MOE_STACKED=1): collapse gate+up
  into a single combined [CACHE_SLOTS, 2*intermediate, hidden] buffer;
  one gatherQuantizedMM produces [..., 2*intermediate], split into
  halves and fed into silu(g) * u. Saves one projection-level dispatch
  per layer per token.
* MLX_MOE_CACHE_SLOTS=N (default 16): cache-slot tunable used by the
  new fast path.

Eligibility: stacked path engages only when all 3 projections are
quantized + resolveSSDInfo() succeeds + idx.size <= 32 (single-token
decode). Ineligible layers and prompt batches return nil and fall
through to the existing path.

## Bench (Qwen3.5-122B-A10B-4bit, M1 Ultra 64 GB, top-k=6, slots=16)

Matched 600-token prompt:

| Config                                       | t/s   | delta vs legacy |
|----------------------------------------------|-------|-----------------|
| upstream baseline (legacy N-buffer)          | ~5.12 | -               |
| MLX_MOE_STACKED=1                            | ~5.92 | +4.6%           |
| MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1      | ~5.64 | +10.2%          |

Variance across prompts is real (different routing patterns ->
different cache hit ratios). Numbers above are per-prompt means; the
original proposal doc has full per-prompt tables.

## How to test

  # On a clean SwiftLM checkout of this branch:
  git submodule update --init --recursive
  swift build -c release --product SwiftLM
  MLX_MOE_STACKED=1 MLX_MOE_FUSE_GATEUP=1 \
  MLX_MOE_CACHE_SLOTS=16 SWIFTLM_TOP_K=6 \
    .build/arm64-apple-macosx/release/SwiftLM \
      --model <path>/Qwen3.5-122B-A10B-4bit \
      --port 8002 --stream-experts

With env flags unset, behavior is identical to upstream main.

## Maintainer note

The submodule pointers reference commits currently hosted on
ericjlake forks. Once you've reviewed and accepted them, you may want
to cherry-pick / merge them into SharpAI/mlx-swift and
SharpAI/mlx-swift-lm and re-run `git submodule update` here so the
pointers resolve from the canonical SharpAI URLs. I left .gitmodules
pointing to the canonical SharpAI repos for that reason.

## References

proposal + iteration log: docs/llm/122b_speedup_proposals.md (in NAS
repo; happy to mirror anywhere on request).
Pins `mlx-swift` and `mlx-swift-lm` to their respective latest `main`
commits, pulling in the newly merged stacked-buffer MoE optimizations
for SwiftLM.

- SharpAI/mlx-swift#10
- SharpAI/mlx-swift-lm#35
@solderzzc solderzzc merged commit 0ceaf20 into SharpAI:main Apr 27, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants