-
Notifications
You must be signed in to change notification settings - Fork 0
Description
What It Does
mlx-lm is Apple's official Python library for running and fine-tuning LLMs directly on Apple Silicon via the MLX framework. It leverages the Neural Engine and unified memory architecture for zero-copy CPU/GPU transfers. Features: native 4-bit quantization (3-4x RAM reduction vs fp16), multi-turn KV cache with context reuse, speculative decoding, LoRA fine-tuning, and streaming inference. Supports thousands of Hugging Face models. A recent benchmark (vllm-mlx, 2026) shows up to 87% higher throughput vs llama.cpp on the same Mac hardware. Install: pip install mlx-lm, run: mlx_lm.generate --model mlx-community/Qwen3-8B-4bit.
Why It Matters for ShellForge
ShellForge currently ties inference exclusively to Ollama. On M-series Macs, mlx-lm could double or triple token throughput for the same model — directly increasing how many agents ShellForge can run in parallel on a given RAM budget. The 4-bit quantization reduces Qwen3-30B from ~19GB to ~8-10GB, potentially fitting a 30B model on an M4 32GB machine (currently borderline). A --backend mlx flag in shellforge run and shellforge serve would let power users opt into the MLX inference path, while keeping Ollama as the default for simplicity.
Integration point: New internal/inference/mlx driver implementing the same InferenceBackend interface as the Ollama driver. Wire into shellforge status for backend health reporting.
Links
- GitHub: https://github.com/ml-explore/mlx-lm
- Stars: 4,236 (created 2025-03-11, actively maintained by Apple)
- License: MIT ✅
- Language: Python
Integration Effort
Heavy — requires spawning mlx_lm as a subprocess or HTTP server alongside Ollama, mapping model names between ecosystems, and handling macOS-only availability gracefully. Suggest shipping as an opt-in experimental backend.