Skip to content

[research] ml-explore/mlx-lm — Apple MLX inference backend, up to 87% faster than Ollama on M-series #73

@jpleva91

Description

@jpleva91

What It Does

mlx-lm is Apple's official Python library for running and fine-tuning LLMs directly on Apple Silicon via the MLX framework. It leverages the Neural Engine and unified memory architecture for zero-copy CPU/GPU transfers. Features: native 4-bit quantization (3-4x RAM reduction vs fp16), multi-turn KV cache with context reuse, speculative decoding, LoRA fine-tuning, and streaming inference. Supports thousands of Hugging Face models. A recent benchmark (vllm-mlx, 2026) shows up to 87% higher throughput vs llama.cpp on the same Mac hardware. Install: pip install mlx-lm, run: mlx_lm.generate --model mlx-community/Qwen3-8B-4bit.

Why It Matters for ShellForge

ShellForge currently ties inference exclusively to Ollama. On M-series Macs, mlx-lm could double or triple token throughput for the same model — directly increasing how many agents ShellForge can run in parallel on a given RAM budget. The 4-bit quantization reduces Qwen3-30B from ~19GB to ~8-10GB, potentially fitting a 30B model on an M4 32GB machine (currently borderline). A --backend mlx flag in shellforge run and shellforge serve would let power users opt into the MLX inference path, while keeping Ollama as the default for simplicity.

Integration point: New internal/inference/mlx driver implementing the same InferenceBackend interface as the Ollama driver. Wire into shellforge status for backend health reporting.

Links

Integration Effort

Heavy — requires spawning mlx_lm as a subprocess or HTTP server alongside Ollama, mapping model names between ecosystems, and handling macOS-only availability gracefully. Suggest shipping as an opt-in experimental backend.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low priority / nice to haveenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions