[research] ml-explore/mlx-lm — Apple MLX inference backend, up to 87% faster than Ollama on M-series

## What It Does

**mlx-lm** is Apple's official Python library for running and fine-tuning LLMs directly on Apple Silicon via the MLX framework. It leverages the Neural Engine and unified memory architecture for zero-copy CPU/GPU transfers. Features: native 4-bit quantization (3-4x RAM reduction vs fp16), multi-turn KV cache with context reuse, speculative decoding, LoRA fine-tuning, and streaming inference. Supports thousands of Hugging Face models. A recent benchmark (vllm-mlx, 2026) shows up to 87% higher throughput vs llama.cpp on the same Mac hardware. Install: `pip install mlx-lm`, run: `mlx_lm.generate --model mlx-community/Qwen3-8B-4bit`.

## Why It Matters for ShellForge

ShellForge currently ties inference exclusively to Ollama. On M-series Macs, mlx-lm could double or triple token throughput for the same model — directly increasing how many agents ShellForge can run in parallel on a given RAM budget. The 4-bit quantization reduces Qwen3-30B from ~19GB to ~8-10GB, potentially fitting a 30B model on an M4 32GB machine (currently borderline). A `--backend mlx` flag in `shellforge run` and `shellforge serve` would let power users opt into the MLX inference path, while keeping Ollama as the default for simplicity.

**Integration point:** New `internal/inference/mlx` driver implementing the same `InferenceBackend` interface as the Ollama driver. Wire into `shellforge status` for backend health reporting.

## Links

- **GitHub:** https://github.com/ml-explore/mlx-lm
- **Stars:** 4,236 (created 2025-03-11, actively maintained by Apple)
- **License:** MIT ✅
- **Language:** Python

## Integration Effort

**Heavy** — requires spawning mlx_lm as a subprocess or HTTP server alongside Ollama, mapping model names between ecosystems, and handling macOS-only availability gracefully. Suggest shipping as an opt-in experimental backend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[research] ml-explore/mlx-lm — Apple MLX inference backend, up to 87% faster than Ollama on M-series #73

What It Does

Why It Matters for ShellForge

Links

Integration Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[research] ml-explore/mlx-lm — Apple MLX inference backend, up to 87% faster than Ollama on M-series #73

Description

What It Does

Why It Matters for ShellForge

Links

Integration Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions