A TUI agent harness for small LLMs running on your Mac
Quick Install · Getting Started · Features · Backends · Tools · Slash Commands · Configuration · Development
Small Harness is a terminal-based agent harness for running small open-weight LLMs locally on consumer Macs. It points the same TUI at five different inference backends: Ollama, LM Studio, MLX, llama.cpp, or OpenRouter cloud. The harness gives the model a focused set of filesystem and shell tools, and gates dangerous operations behind an approval prompt.
It is built for developers who want to use a 7B–14B model as an interactive coding assistant without depending on a cloud API. Hardware profiles for the Mac mini (16 GB) and Mac Studio (32 GB) pick sensible default models per backend so you can start running without picking weights out of a long list.
| Area | What you get |
|---|---|
| First-run setup | Interactive wizard writes agent.config.json, picks backend/profile/model, chooses approval/tool mode, and probes the backend |
| Local-first | OpenAI-compatible chat completions against Ollama, LM Studio, MLX, or llama.cpp, all selectable at runtime |
| Cloud comparison | One-key A/B against any OpenRouter model with /compare |
| Hardware profiles | mac-mini-16gb and mac-studio-32gb map to model defaults sized for the box |
| Configurable tools | File read/write/edit, apply-patch, glob, grep, list-dir, shell — pick which to enable to control prompt-eval cost |
| Approval gates | Per-tool prompts with diff previews, allow-once / allow-this-session / always-allow caching |
| Robust parsing | Inline JSON-shaped tool-call detector for small models whose templates skip the tool_calls field |
| Pre-warm at startup | Sends a 1-token request with the full system prompt + tools so the cache is hot before your first prompt |
| Efficiency mode | Auto-selects tool schemas per prompt, shows prompt-budget breakdowns, and compacts large tool outputs |
| Streaming output | Tokens stream as they arrive, with a grouped tool-call display |
| Session persistence | JSONL append-only session logs with list, resume, and export commands |
| Slash commands | /setup, /backend, /profile, /model, /tools, /compare, /session, /sessions, /resume, /export, /doctor, /bench, /eval, /new, /help |
| Bordered TUI | Clean terminal box input with persisted history, arrow recall, and Ctrl-J multi-line prompts |
You will need Rust (stable, 1.75+) and one local-inference backend running.
git clone https://github.com/morganlinton/SmallHarness.git
cd SmallHarness
cp .env.example .env
cargo run --releaseBuild a standalone binary with cargo build --release — it lands at
target/release/small-harness (~5 MB).
By default Small Harness talks to Ollama at http://localhost:11434/v1. To
target LM Studio, MLX, or llama.cpp instead, set BACKEND=lm-studio,
BACKEND=mlx, or BACKEND=llamacpp before running, or use /backend once
the harness is running.
If agent.config.json does not exist, the first run opens a short setup
wizard that writes one for you and probes the selected backend. Set
SMALL_HARNESS_NO_WIZARD=true to skip the wizard and use env/defaults only.
Pick one. Ollama is the fastest path on a fresh box:
brew install ollama
brew services start ollama
ollama pull qwen2.5-coder:7bLM Studio (already installed), MLX, and llama.cpp are also supported. See Backends for ports and setup notes.
cargo run --releaseOn a fresh checkout, the setup wizard asks for backend, hardware profile,
optional model override, approval policy, and adaptive/fixed tool mode, then
writes agent.config.json. After setup, you will see the banner, a backend
probe, and a "Warming up" spinner that populates the prompt-eval cache so the
first prompt isn't slow. When the input box opens, type a question:
> what files are in src/?
/backend lm-studio switch to LM Studio
/backend llamacpp switch to llama.cpp
/setup rerun setup and rewrite agent.config.json
/profile mac-studio-32gb switch the hardware profile (changes default model)
/model list models from the current backend and pick one
/tools show enabled tools and auto/fixed selection mode
/compare run the same prompt against OpenRouter cloud
/sessions list saved JSONL sessions
/resume latest resume the newest saved session
/doctor check backend, config, rg, and session storage
/doctor --deep probe stream, usage, and tool-call capabilities
Each tool definition costs prompt-eval time on small local models. Small
Harness defaults to toolSelection: "auto", so ordinary chat sends no tool
schemas, file/code questions send read/search/list schemas, edit requests add
edit/patch schemas, and shell-ish prompts add shell when it is enabled.
The tools list is the allowed pool:
/tools auto adaptive tool selection (default)
/tools fixed always send every enabled tool schema
/tools file_read,grep,list_dir
/tools auto file_read,grep,list_dir
Or set persistently in agent.config.json:
{ "tools": ["file_read", "file_edit", "grep", "list_dir"] }| Backend | Default URL | API style | Best for |
|---|---|---|---|
ollama |
http://localhost:11434/v1 |
OpenAI-compatible | Easiest setup; mature tool-call templates; CLI model management |
lm-studio |
http://localhost:1234/v1 |
OpenAI-compatible | GUI model browser; explicit load/unload controls |
mlx |
http://localhost:8080/v1 |
OpenAI-compatible (via mlx_lm.server) |
Fastest inference on Apple Silicon |
llamacpp |
http://localhost:8080/v1 |
OpenAI-compatible (via llama-server) |
Direct GGUF serving; fastest path if you already use llama.cpp |
openrouter |
https://openrouter.ai/api/v1 |
OpenAI-compatible | Cloud A/B comparison; access to larger frontier models |
Override URLs with OLLAMA_BASE_URL, LM_STUDIO_BASE_URL, MLX_BASE_URL,
or LLAMACPP_BASE_URL. openrouter requires OPENROUTER_API_KEY.
llamacpp uses LLAMACPP_API_KEY only if your llama-server enforces one.
The official @openrouter/agent SDK speaks OpenRouter's newer /responses
endpoint. Small Harness uses a hand-rolled reqwest + SSE client pointed at
each backend's baseURL because /v1/chat/completions is the common shape
across the supported local servers and OpenRouter cloud, even when a backend
also exposes newer endpoints.
| Tool | Default | Approval | What it does |
|---|---|---|---|
apply_patch |
off | yes | Validate and apply a unified diff with git apply --check |
file_read |
on | no* | Read a file (text or image base64) with optional offset/limit |
file_edit |
on | yes | Search-and-replace edits with unique-match validation, returns unified diff |
grep |
on | no | Regex search file contents (uses ripgrep) |
list_dir |
on | no* | List directory entries, alphabetical, capped at 500 |
file_write |
off | yes | Write/create a file (overwrites) |
glob |
off | no* | Find files by glob pattern |
shell |
off | yes | Run a shell command, output capped at 256 KB |
* Read-only tools prompt when outsideWorkspace is prompt and the request
targets a path outside workspaceRoot.
Toggle the active set per session with /tools, per shell with the
AGENT_TOOLS env var, or persistently in agent.config.json.
| Policy | Behavior |
|---|---|
always (default) |
Every call to a mutating tool prompts you |
dangerous-only |
Only shell calls matching rm, sudo, chmod, dd, mkfs, etc. prompt; safer commands run silently |
never |
No prompts (use only when you trust the model) |
At each prompt you can choose [y]es, [n]o, [a]lways for this tool, or
[s]ession-allow this exact call. The session cache resets on /new.
| Command | Description |
|---|---|
/help |
List available commands |
/setup |
Run the setup wizard, write agent.config.json, probe the backend, and apply the new config |
/new |
Start a fresh conversation |
/clear |
Clear the screen |
/config |
Show resolved backend, model, workspace, history, display, and context config |
/session |
Show backend, model, approval policy, session path, message count, total tokens |
/sessions |
List saved sessions under .sessions/ |
/resume latest|<id> |
Resume a saved session |
/export current|<id> [markdown|json] [path] |
Export a session transcript |
/backend [name] |
Switch backend (ollama, lm-studio, mlx, llamacpp, openrouter) |
/profile [name] |
Switch hardware profile (mac-mini-16gb, mac-studio-32gb) |
/model [id] |
List models from the current backend and pick one, or set directly |
/tools [auto|fixed|list] |
Show enabled tools, switch adaptive mode, or set the enabled pool: /tools auto file_read,grep,list_dir |
/compare [model] |
Re-send the last user message to OpenRouter cloud for A/B |
/context [maxMessages=N maxBytes=N] |
Show prompt budget, active adaptive tools, byte/token estimate, and context limits |
/compact [keep] |
Summarize older turns into a compact continuation session |
/doctor |
Check backend reachability, model list, rg, config, and session storage |
/doctor --deep [all] |
Probe OpenAI-compatible streaming, usage chunks, native tool calls, and inline JSON fallback, then save JSON/Markdown reports under .sessions/doctor/ |
/bench [model] |
Measure warmup, first-token, total latency, and output rate |
/eval [prompt-file] [models] |
Run saved prompts against one or more models with tools off/on |
exit |
Quit |
/doctor --deep checks the active backend. Add all to probe every configured
backend with short timeouts; unreachable backends show as failed rows in the
capability table.
The profile drives the default model per backend. You can always override
with AGENT_MODEL or /model.
| Profile | Default Ollama model | Default LM Studio model | Default MLX model | Default llama.cpp model |
|---|---|---|---|---|
mac-mini-16gb |
qwen2.5-coder:7b |
qwen2.5-coder-7b-instruct |
mlx-community/Qwen2.5-Coder-7B-Instruct-4bit |
gpt-3.5-turbo |
mac-studio-32gb |
qwen2.5-coder:14b |
qwen2.5-coder-14b-instruct |
mlx-community/Qwen2.5-Coder-14B-Instruct-4bit |
gpt-3.5-turbo |
The OpenRouter cloud default for both profiles is
qwen/qwen-2.5-coder-32b-instruct. The llama.cpp default mirrors the
llama-server OpenAI-compatible examples; use /model or start
llama-server with --alias if you want the loaded GGUF to advertise a
specific model id.
llama.cpp and llama.cpp-derived engines cache the prompt-eval result for any
prefix they have already seen. At startup, Small Harness sends a tiny
chat-completions request with the full system prompt + tool definitions and
max_tokens: 1.
That populates the cache, so your first real prompt only has to evaluate
the new user tokens — typically dropping first-prompt latency from ~12 s to
~2 s on a 7B q4 model.
Disable with WARMUP=false if you want a faster startup at the cost of a
slow first prompt.
The cache becomes stale when you change /backend, /model, or /tools.
The next prompt after a switch will pay the prompt-eval cost again.
# Backend selection: ollama (default), lm-studio, mlx, llamacpp, openrouter
BACKEND=ollama
# Hardware profile: mac-mini-16gb (default) or mac-studio-32gb
PROFILE=mac-mini-16gb
# Override the model for the chosen backend
AGENT_MODEL=qwen2.5-coder:14b
# Per-backend endpoint overrides
OLLAMA_BASE_URL=http://localhost:11434/v1
LM_STUDIO_BASE_URL=http://localhost:1234/v1
MLX_BASE_URL=http://localhost:8080/v1
LLAMACPP_BASE_URL=http://localhost:8080/v1
# Optional if llama-server was started with API-key enforcement
LLAMACPP_API_KEY=sk-no-key-required
# Required when BACKEND=openrouter or you want /compare
OPENROUTER_API_KEY=sk-or-...
# Approval policy: always (default) | never | dangerous-only
APPROVAL_POLICY=always
# Active tools, comma-separated. Default: file_read,file_edit,grep,list_dir
AGENT_TOOLS=file_read,file_edit,grep,list_dir
# Tool schema selection: auto (default) or fixed
AGENT_TOOL_SELECTION=auto
# Pre-warm the model at startup (default: on)
WARMUP=true
# Skip first-run setup and rely on env vars / built-in defaults
SMALL_HARNESS_NO_WIZARD=false
# Maximum agent steps per turn
AGENT_MAX_STEPS=20
# Workspace safety: prompt (default), deny, allow
WORKSPACE_ROOT=/path/to/project
OUTSIDE_WORKSPACE=prompt
# Context/history tuning
AGENT_CONTEXT_MAX_MESSAGES=40
AGENT_CONTEXT_MAX_BYTES=262144
AGENT_HISTORY=true
AGENT_HISTORY_MAX_ENTRIES=200For project-level defaults, run /setup or drop a JSON file in the repo root.
Anything you put here can be overridden by env vars or slash commands at
runtime.
{
"backend": "ollama",
"profile": "mac-mini-16gb",
"approvalPolicy": "dangerous-only",
"tools": ["file_read", "file_edit", "grep", "list_dir"],
"toolSelection": "auto",
"maxSteps": 20,
"workspaceRoot": "/path/to/project",
"outsideWorkspace": "prompt",
"context": {
"maxMessages": 40,
"maxBytes": 262144
},
"history": {
"enabled": true,
"maxEntries": 200
},
"profiles": {
"mac-studio-fast": {
"ollama": "qwen2.5-coder:14b",
"llamacpp": "gpt-3.5-turbo",
"openrouter": "qwen/qwen-2.5-coder-32b-instruct"
}
},
"display": {
"toolDisplay": "grouped",
"inputStyle": "bordered",
"loaderStyle": "spinner",
"loaderText": "Thinking",
"showBanner": true
}
}- Slash command overrides at runtime
- Process environment variables (
BACKEND,PROFILE,AGENT_MODEL,AGENT_TOOLS, …) .env.local, then.envagent.config.jsonin the working directory- Built-in defaults
+-------------------------+
| main.rs |
| banner / input loop / |
| warmup / approval |
+------------+------------+
|
v
+--------------+ +-------------------------+ +-------------------+
| config.rs |--->| agent.rs |<-->| tools/*.rs |
| dotenv+JSON | | chat/completions loop | | serde-typed, |
| + profiles | | streaming + tool calls | | approval-gated |
+--------------+ +------------+------------+ +-------------------+
|
v
+-------------------------+
| backends.rs |
| Ollama / LM Studio / |
| MLX / llama.cpp / |
| OpenRouter |
+-------------------------+
|
v
+-------------------------+
| session.rs |
| JSONL sessions/export |
+-------------------------+
cargo check # type-check without producing a binary
cargo run # debug build + run (faster compile, slower runtime)
cargo run --release # optimized build + run
cargo build --release # produce target/release/small-harnessProject layout:
src/
main.rs entry — input loop, loader, approval wiring, warmup
agent.rs chat/completions runner with tool calls + streaming
backends.rs Ollama / LM Studio / MLX / llama.cpp / OpenRouter endpoints + defaults
config.rs dotenv + agent.config.json loader, workspace/context/history config
approval.rs y/n/always/session-allow prompt with diff previews
session.rs JSONL conversation log, listing, resume, export helpers
warmup.rs pre-warm the prompt-eval cache at startup
commands.rs slash commands for sessions, config, backends, evals, doctor, bench
renderer.rs grouped tool display
loader.rs spinner / gradient / minimal loaders
banner.rs ASCII banner + dynamic backend/profile/model line
input.rs bordered + plain readers with history and multi-line input
openai.rs wire types + SSE streaming for chat completions
tools/ apply_patch, file_read, file_write, file_edit, glob_tool, grep, list_dir, shell
Quality expectations:
cargo checkmust pass cleanly.- Tools that mutate filesystem state implement
require_approvalon theTooltrait (returningtrue, or computing it from the args for dangerous shapes — seeshell.rs). - New backends should expose an OpenAI-compatible
/v1/chat/completionsendpoint and add a profile-default model map inbackends.rs.
Versioning:
- Small Harness stays on the
0.1.xline before a larger product milestone. - The patch number tracks the total repo commit count for the release commit.
This setup release is
0.1.30: 29 commits were already onmain, and the release commit is expected to be commit 30. - Release tags should use a leading
v, for examplev0.1.30.
The harness probes the backend at startup. If you see this message, the named backend is not listening on the expected port. Suggestions:
- Ollama:
brew services start ollama, or runollama servein a separate terminal. Default port 11434. - LM Studio: open the app, go to "Local Server", click Start. Default port 1234.
- MLX: start
mlx_lm.server --port 8080against an MLX-format model. - llama.cpp: start
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 8080. Add--jinjawhen you want native OpenAI-style tool calls. - OpenRouter: set
OPENROUTER_API_KEYin.env.
For backend-specific capability problems, run /doctor --deep. It exercises
/v1/models, streaming chat completions, usage chunks, a harmless tool-call
schema, and Small Harness' inline JSON fallback detector. Reports are saved to
.sessions/doctor/ for sharing or comparison.
If you change /backend, /model, /tools, or the hardware profile after
warmup, the cached prefix becomes stale and the next prompt re-evaluates
the new system prompt + tools. This is one-time per change.
Some small-model templates emit tool calls as plain content
(e.g. {"name":"shell","arguments":{...}}) instead of populating the
tool_calls field. Small Harness detects this pattern and synthesizes a
real tool call. If a particular model still misbehaves, switching to
llama3.1:8b (which has well-tested tool-call templates) usually resolves
it.
Some bilingual models (notably the qwen family) drift into Chinese on short
greetings. The system prompt now includes an explicit language directive,
but you can strengthen it further by editing SYSTEM_PROMPT in
src/config.rs.
Install Rust via rustup: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh.
Small Harness is released under the MIT License.
