High-performance, macOS-native LLM inference engine for Apple Silicon.
MLXR is a local LLM runner built specifically for Apple Silicon (M4, M3, M2) that combines:
- MLX framework for tensor/graph management
- Custom Metal kernels for performance-critical operations
- OpenAI and Ollama-compatible APIs for seamless integration
- React-based GUI with real-time streaming
- Native Performance: Custom Metal kernels optimized for Apple's unified memory architecture
- Memory Efficient: Paged KV cache with smart eviction policies
- High Throughput: Continuous batching and speculative decoding
- Model Support: GGUF, HF safetensors, and native MLX formats
- Quantization: Full support for Q2_K through Q8_K, FP8, and NF4
- Developer Friendly: OpenAI and Ollama-compatible REST APIs
┌─────────────────────────┐
│ React WebView GUI │
│ (Tray/Dock App) │
└───────────┬─────────────┘
│ Unix Domain Socket
┌───────────▼─────────────┐
│ Daemon (REST/gRPC) │
│ - OpenAI API │
│ - Ollama API │
│ - Model Registry │
└───────────┬─────────────┘
│
┌───────────▼─────────────┐
│ Inference Core │
│ - MLX Graph │
│ - Metal Kernels │
│ - Paged KV Cache │
│ - Continuous Batching │
└─────────────────────────┘
- First Token: < 1s for 7B-8B models at 4-bit
- Decode: < 80ms/token steady-state
- Embeddings: < 20ms/sample
- Occupancy: ≥ 60% GPU utilization on attention kernels
- macOS 14.0 (Sonoma) or later
- Apple Silicon (M2, M3, or M4)
- Xcode 15+ (for building)
- Homebrew package manager
- CMake 3.20+
- Python 3.11+ (with MLX installed)
- Node.js 18+ (for frontend development)
The following Homebrew packages are required:
# Install all dependencies at once
brew install cmake ninja mlx sentencepiece nlohmann-json cpp-httplib googletest
# Or use the Makefile convenience target
make install-depsNote: CMake and Ninja must be installed via Homebrew, not Conda. The conda-forge "cmake" package is unrelated to the CMake build system.
✅ Core Infrastructure Complete - Integration Work Remaining
Codebase Size: ~50,000 LOC across core, daemon, app, tests, and SDKs
Phase 1: Minimal Inference ✅ 100%
- Complete Llama model with safetensors loading (737 lines)
- SentencePiece tokenizer (252 lines)
- Sampling strategies (greedy, temperature, top-k, top-p) - 534 lines
- Working text generation pipeline
- Example: simple_generation.cpp - WORKS ✅
Phase 2: Optimization ✅ 95%
- KV Cache System - Complete paged arena (2,373 lines) with LRU eviction, GQA support
- Scheduler - Continuous batching (439 lines) with prefill/decode separation
- Metal Kernels - All 6 kernels implemented (~5,200 LOC total):
- RMSNorm: 217 lines shader + 362 lines primitive - INTEGRATED & TESTED (81/81 tests) ✅
- Attention Decode: 295 + 574 lines - Ready for integration
- Attention Prefill: 370 + 633 lines - Ready for integration
- RoPE: 434 + 478 lines - Ready for integration
- SwiGLU MLP: 432 + 321 lines - Ready for integration
- Q-Gemm Dequant: 486 + 525 lines - Ready for integration
- Test Daemon - Working HTTP server (
test_daemon) with health/models endpoints ⚠️ Integration Gap: Metal kernels need wiring in CachedAttention (8-16 hours)
Phase 3: Service Layer ✅ 70%
- REST API - OpenAI & Ollama-compatible endpoints (1,758 lines)
- gRPC Server - FULLY IMPLEMENTED (1,101 lines) with streaming
- SSE Streaming - Real-time token generation (621 lines)
- Model Registry - SQLite catalog (1,137 lines) with GGUF parser (891 lines)
- Telemetry - Metrics collection (769 lines, 15/15 tests passing)
- Test Suite - 14 C++ unit test files, 299 total tests
⚠️ Integration Gap: Model loading → Engine → Worker wiring (4-8 hours)
Frontend ✅ 90%
- React UI - 78 components fully implemented
- Chat Interface - With streaming and tool calls
- Model Management - Pull, import, convert, quantize
- Metrics Dashboard - Real-time performance visualization
- All Pages - Chat, Models, Playgrounds, Metrics, Settings, Logs
macOS App ✅ 90%
- Swift Components - 20 files implementing app host
- JavaScript Bridge - Complete UDS communication
- Xcode Project - Exists and configured
- Daemon Management - launchd integration
⚠️ Missing: .app bundle build, code signing, .dmg creation
SDKs ✅ 95%
- Python SDK - Complete client with async support
- TypeScript SDK - Full type definitions and clients
- Swift SDK - SwiftPM package with examples
- Prefill: 198-459 ms (5-10 tokens)
- Decode: 53-220 ms/token
- Throughput: 4.5-18.9 tokens/sec
- Memory: 87.5% reduction with GQA (308 MB saved)
- Expected after Metal integration: 2-5x improvement
P0 - Get it Working (14-28 hours):
⚠️ Metal Kernel Integration (8-16h) - Wire attention kernels in CachedAttention⚠️ Daemon Model Loading (4-8h) - Complete load_model() in REST server- Server Config (2-4h) - Create configs/server.yaml
P1 - Make it Fast (20-32 hours): 4. Quantization (8-12h) - GGUF loading + Q-gemm integration 5. RoPE/SwiGLU Kernels (6-10h) - Wire remaining kernels 6. Speculative Decoding (6-10h) - Connect draft model
P2 - Ship It (36-60 hours): 7. App Bundle (8-16h) - .dmg, code signing, auto-update 8. CPU Fallbacks (16-24h) - Neon SIMD implementations 9. Conversion Tools (12-20h) - GGUF→MLX, quantizers
See CLAUDE.md for comprehensive implementation roadmap and accurate metrics.
MLXR/
app/ # macOS app bundle & React UI
daemon/ # Background server (REST/gRPC)
core/ # Inference engine (C++/MLX/Metal)
tools/ # Model converters and utilities
sdks/ # Client libraries (Python, TS, Swift)
configs/ # Configuration files
scripts/ # Build and development scripts
tests/ # Test suites
plan/ # Architecture specifications
# Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Clone the repository
git clone <repository-url>
cd MLXR
# Install system dependencies
make install-deps
# Setup Python environment (recommended)
make setup
conda activate mlxr
# Check installation
make status# Full build (Metal shaders + C++ core)
make build
# Or quick development build (Metal only)
make dev
# Run tests
make test-cpp# Run daemon
./build/cmake/bin/mlxrunnerd
# Develop frontend (separate terminal)
cd app/ui
yarn install
yarn devFor detailed build instructions, see CLAUDE.md.
- Repository structure and build system
- Metal shader compilation pipeline
- Toolchain validation (Homebrew, CMake, Ninja)
- MLX integration and model loading (safetensors)
- SentencePiece tokenizer
- Single-request inference with FP16
- Working examples in
examples/
- Paged KV cache with eviction policies
- Metal kernel implementations (RMSNorm tested, others implemented)
- Continuous batching scheduler
- GQA support for memory efficiency
- Next: CachedLlamaModel integration with Engine
- REST API daemon (OpenAI & Ollama compatible)
- Model registry with SQLite backend
- SSE streaming for real-time generation
- Telemetry and metrics
- Next: Full API endpoint integration
- macOS app bundle with React WebView
- Unix domain socket communication
- Auto-updates via Sparkle
- Code signing and notarization
See plan/SPEC01.md for complete roadmap and docs/IMPLEMENTATION_STATUS.md for current status.
Developer Guides:
- CLAUDE.md - Comprehensive development guide and coding standards
- docs/IMPLEMENTATION_STATUS.md - Current implementation status and metrics
- docs/SECURITY_FIXES.md - Security vulnerability tracking and best practices
Architecture & Planning:
- plan/SPEC01.md - Complete specification and requirements
- plan/Structure.md - Architecture and component overview
- plan/MetalKernelsPlan.md - Metal kernel specifications
- plan/PackagingDistro.md - Distribution strategy
- plan/FrontendPlan.md - React UI implementation plan
Implementation Details:
- docs/PHASE2_COMPLETION.md - Scheduler-engine integration details
- docs/DAEMON_STATUS.md - Daemon components status
- docs/KV_CACHE_IMPLEMENTATION.md - Paged KV cache architecture
- docs/GQA_RESHAPE_FIX.md - Critical GQA support fix
- app/ui/COMPONENTS.md - React UI components documentation
This project is actively developed and welcomes contributions!
Current Focus Areas:
- CachedLlamaModel integration with Engine
- Metal kernel optimization and testing
- OpenAI API endpoint completion
- Performance benchmarking and profiling
Before Contributing:
- Read CLAUDE.md for development guidelines
- Check docs/IMPLEMENTATION_STATUS.md for current status
- Review docs/SECURITY_FIXES.md for security best practices
Development Standards:
- ✅ All C++ code passes unit tests
- ✅ Security: No
system()calls, proper input validation, ReDoS-safe regex - ✅ Cross-platform: Use
std::filesystemfor paths - ✅ Documentation: Update docs for significant changes
Apache License 2.0 - See LICENSE for details.
Built with:
- MLX - Apple's machine learning framework
- Metal - Apple's GPU compute API
- Inspired by vLLM, llama.cpp, and Ollama
Status: Active Development (Phase 2 Complete, Phase 3 In Progress) Target: Q1 2025 MVP release Latest: See docs/IMPLEMENTATION_STATUS.md for current metrics and progress