-
Notifications
You must be signed in to change notification settings - Fork 95
Description
Tool Submission: Rapid-MLX
Name: Rapid-MLX
URL: https://github.com/raullenchai/Rapid-MLX
Category: Developer Tools / AI / Local LLM Inference
License: Apache-2.0
Language: Python
What is Rapid-MLX?
An OpenAI-compatible local LLM inference server built specifically for Apple Silicon. It delivers 2–4x faster token generation than Ollama by running directly on MLX (Apple's ML framework) with a highly optimized streaming pipeline.
Key Features
- OpenAI-compatible API — drop-in replacement, works with any OpenAI SDK client
- 2–4x faster than Ollama on Apple Silicon (M1/M2/M3/M4)
- Tool calling — full function/tool calling support for agentic workflows
- Reasoning models — streaming
<think>token support (Qwen3, DeepSeek-R1, etc.) - Vision & Audio — multimodal model support
- Structured output — JSON schema enforcement
- Prompt caching — persistent KV cache across requests for faster multi-turn chats
- Speculative decoding (MTP) — 1.4x additional decode speedup on supported models
Install
# Homebrew (macOS)
brew install raullenchai/rapid-mlx/rapid-mlx
# pip
pip install rapid-mlxWhy it's relevant to developers
Local LLM inference on Mac has historically been bottlenecked by Ollama's overhead. Rapid-MLX bypasses that by integrating directly with Apple's MLX framework, giving developers a fully OpenAI-compatible server that runs substantially faster — making local AI development and testing much more practical on MacBooks and Mac Studios.
Benchmark (Qwen3.5-9B, M3 Ultra)
| Engine | Tokens/sec |
|---|---|
| Rapid-MLX | ~95 tok/s |
| mlx-lm | ~90 tok/s |
| Ollama | ~23 tok/s |
Rapid-MLX is 4.2x faster than Ollama on this configuration.
Happy to provide any additional info or assets needed.