FountainKit Local Agent Runtime provides a turnkey environment for running open‑source large‑language models (LLMs) on macOS and iOS devices and integrating them into the FountainKit ecosystem. Unlike one‑off demo apps, this project is designed for teams who want to own the entire local‑inference stack, including the Swift wrappers around the inference engine. The runtime loads quantized GGUF models from Hugging Face, exposes a simple chat API compatible with OpenAI’s function‑calling protocol and registers itself as a persona in the FountainKit gateway. This allows you to build intelligent agents that run offline, protect user privacy and avoid cloud inference costs while maintaining full control over the code.
- Offline and private – Runs LLM inference entirely on the user’s device. No data is sent to third‑party servers. Works on Apple Silicon (M1/M2/M3) Macs and iPhones/iPads.
- Function‑calling capable – Supports models tuned for JSON‑structured function calls (e.g. Gorilla OpenFunctions v2, Hermes 2 Pro, FireFunction v1). Accepts a list of function definitions and returns a function call or text response.
- Hugging Face interoperability – Easily download and manage models from Hugging Face. Supports GGUF quantization for efficient inference via
llama.cpp
. Model size can be reduced via Q4/Q5 quantization. - Native Apple tooling and custom wrappers – Offers multiple backends: compile
llama.cpp
yourself and wrap it in Swift for full control, use Core ML for hardware‑accelerated inference, or leverage MLX for GPU‑accelerated models. You’re not tied to third‑party wrappers—this repository includes examples and guidance for building your own, with Kuzco or LocalLLMClient serving only as references if you need inspiration. - Seamless FountainKit integration – Provides configuration templates and persona definitions so your local agent can plug into the FountainKit gateway, planner and function‑caller services.
- Extensible service layer – Built on SwiftNIO, the chat service can be extended to support streaming responses, authentication, custom logging or additional endpoints.
- Apple hardware – A Mac with an M1 or later processor is recommended. iOS and Mac Catalyst apps require iOS 15+/macOS 12+.
- Xcode – Install the latest Xcode to build Swift packages and run the sample app.
- Model weights – Download one or more function‑calling models from Hugging Face. We recommend starting with:
Ensure that the model licences permit your intended use (many are Apache 2.0). For FireFunction v1, use the Ollama “firefunction” model tag to download a quantized version.
The repository contains two main components:
AgentService
is a Swift package that hosts a local LLM and exposes a /chat
endpoint that adheres to the OpenAI chat completion and function‑calling schema. The package demonstrates how to bridge a C inference engine into Swift using a module.modulemap
and how to implement an async streaming API in pure Swift. It intentionally avoids depending on third‑party Swift wrappers, so you own the wrapper layer and can adapt it to your needs.
- Add
AgentService
to your project via Swift Package Manager. - Configure the service with the path to your GGUF or Core ML model in
agent-config.json
. When using a GGUF model, the service links directly against your compiledllama.cpp
library. When using a Core ML model, it loads the.mlmodel
viaMLModel(contentsOf:)
. - Run the service using
swift run AgentService
and verify thathttp://localhost:8080/chat
returns responses. - Streaming is supported at
POST /chat/stream
(Server‑Sent Events).
This module contains a FountainKit persona and configuration templates. It demonstrates how to register the local agent as a persona and route requests through FountainKit’s gateway.
- Copy
LocalAgentPersona.md
into your FountainKit repository’spersonas/
directory. - Update the
gateway.yaml
to include theLocalAgentPersona
plugin in the plugin chain. - Start FountainKit’s gateway, planner, function‑caller and persistence services.
- Send a chat request through the gateway; the planner will delegate the function‑calling request to your local agent.
Most function‑calling models require several gigabytes of memory in their native form. To run them comfortably on consumer hardware, quantize them:
- Download – Use
huggingface-cli
orgit lfs
to pull the model repository. Alternatively, for FireFunction models, useollama pull joefamous/firefunction-v1:q4_0
. - Quantize – If the repository provides pre‑quantized GGUF files, download them directly. Otherwise, convert the model using
llama.cpp
’sconvert.py
script andquantize.py
(or the equivalent tools in other inference engines). - Store – Place the GGUF file in
Models/
and update your configuration.
Quantization levels (q2
, q3
, q4
, q5
, q6
) represent a trade‑off between memory usage and accuracy. Start with q4
and adjust based on performance.
The runtime supports multiple inference backends:
llama.cpp
via custom Swift wrappers – Compilellama.cpp
yourself and create a Swift wrapper that exposes its C API. This gives you full control over the code and eliminates reliance on third‑party packages. Projects like Kuzco demonstrate how to wrapllama.cpp
in Swift; use them as references rather than dependencies.- Core ML – Convert a PyTorch model to Core ML using the coremltools Python library. Apple’s “On Device Llama 3.1 with Core ML” guide shows how to export Llama 3.1‑8B‑Instruct and achieve ~33 tokens/s on a Mac M1 Max. After conversion, load the
.mlmodel
in Swift and callprediction(from:)
to get outputs. - MLX – Apple’s MLX framework offers GPU‑accelerated inference. The LocalLLMClient package includes an MLX backend. Use it for models that have been ported to MLX (e.g. Gemma, Qwen 2.5) and when you require higher throughput than
llama.cpp
.
Choose the backend that best fits your hardware and model.
FountainKit orchestrates requests through its gateway, planner, function‑caller and persistence services. To integrate your local runtime:
- Implement a persona – Create a persona definition in Markdown (see
LocalAgentPersona.md
) that instructs the gateway to forward certain requests (e.g. those requiring local inference) to your agent’s/chat
endpoint. - Add the persona to the gateway – Edit
gateway.yaml
to include your persona plugin in the plugin chain. You may also register the persona for specific triggers (e.g. project type or user role). - Register functions – Use the function‑caller service to register the functions your model can call. For example, if your application can set timers, schedule meetings or fetch data, describe each function with a name, description and JSON schema.
- Test – Run the FountainKit services and your local agent. Send a high‑level objective (e.g. “schedule a meeting with my team next Tuesday at 3 PM”). The planner will ask the agent to produce a function call, then the function‑caller will execute it, and the gateway will return the result.
- Service security – Only expose the
/chat
API locally or behind authentication when deploying to production. Validate and sanitize all user inputs. - Resource management – Running large models on a laptop will consume CPU/GPU and memory. Monitor system resources and choose appropriate quantization levels.
- Model updates – Keep track of new releases on Hugging Face; models are rapidly improving and may offer better function‑calling accuracy or smaller footprints.
Contributions are welcome! If you find a bug or want to add support for another backend, please open an issue or submit a pull request. When adding new models or examples, ensure that they comply with their respective licences and include attribution.
This project is released under the MIT licence. Third‑party models and libraries may have their own licences—please review them before use in production.
Prebuilt macOS arm64 binaries are published on tagged releases (v*.*.*
).
-
Artifact:
AgentService-macos-arm64.tar.gz
(contains theAgentService
executable) -
Requirements when using llama backend:
brew install llama.cpp
(ensureslibllama
is present at runtime)- Or build
llama.cpp
from source and make itsllama
library discoverable viaDYLD_LIBRARY_PATH
or install path
-
Legal: This repository uses MIT‑licensed code. Model weights and
llama.cpp
carry their own licences; ensure compliance if redistributing. -
Static-linked variant:
AgentService-macos-arm64-static.tar.gz
- Contains a binary statically linked against a pinned
llama.cpp
commit (default:b6550
). - CI explicitly builds with
LLAMA_CPP_REF=b6550
for reproducibility. - To override locally, run builds with a different ref:
LLAMA_CPP_REF=<tag-or-commit> ./scripts/build-llama-static.sh
.
- Contains a binary statically linked against a pinned
-
Prebuilt xcframework:
LlamaCppCBinary.xcframework.tar.gz
- A SwiftPM-consumable binary target for the
LlamaCppC
shim + staticlibllama
. - CI builds with
LLAMA_CPP_REF=b6550
. Locally you can runLLAMA_CPP_REF=<tag-or-commit> ./scripts/build-prebuilt-xcframework.sh
.
- A SwiftPM-consumable binary target for the