Small Harness

A TUI agent harness for small LLMs running on your Mac

Quick Install · Getting Started · Features · Backends · Tools · Slash Commands · Configuration · Development

What Is Small Harness?

Small Harness is a terminal-based agent harness for running small open-weight LLMs locally on consumer Macs. It points the same TUI at five different inference backends: Ollama, LM Studio, MLX, llama.cpp, or OpenRouter cloud. The harness gives the model a focused set of filesystem and shell tools, and gates dangerous operations behind an approval prompt.

It is built for developers who want to use a 7B–14B model as an interactive coding assistant without depending on a cloud API. Hardware profiles for the Mac mini (16 GB) and Mac Studio (32 GB) pick sensible default models per backend so you can start running without picking weights out of a long list.

Features

Area	What you get
First-run setup	Interactive wizard writes `agent.config.json`, picks backend/profile/model, chooses approval/tool mode, and probes the backend
Local-first	OpenAI-compatible chat completions against Ollama, LM Studio, MLX, or llama.cpp, all selectable at runtime
Cloud comparison	One-key A/B against any OpenRouter model with `/compare`
Hardware profiles	`mac-mini-16gb` and `mac-studio-32gb` map to model defaults sized for the box
Configurable tools	File read/write/edit, apply-patch, glob, grep, list-dir, shell — pick which to enable to control prompt-eval cost
Approval gates	Per-tool prompts with diff previews, allow-once / allow-this-session / always-allow caching
Robust parsing	Inline JSON-shaped tool-call detector for small models whose templates skip the `tool_calls` field
Pre-warm at startup	Sends a 1-token request with the full system prompt + tools so the cache is hot before your first prompt
Efficiency mode	Auto-selects tool schemas per prompt, shows prompt-budget breakdowns, and compacts large tool outputs
Streaming output	Tokens stream as they arrive, with a grouped tool-call display
Session persistence	JSONL append-only session logs with list, resume, and export commands
Slash commands	`/setup`, `/backend`, `/profile`, `/model`, `/tools`, `/compare`, `/session`, `/sessions`, `/resume`, `/export`, `/doctor`, `/bench`, `/eval`, `/new`, `/help`
Bordered TUI	Clean terminal box input with persisted history, arrow recall, and Ctrl-J multi-line prompts

Quick Install

You will need Rust (stable, 1.75+) and one local-inference backend running.

git clone https://github.com/morganlinton/SmallHarness.git
cd SmallHarness
cp .env.example .env
cargo run --release

Build a standalone binary with cargo build --release — it lands at target/release/small-harness (~5 MB).

By default Small Harness talks to Ollama at http://localhost:11434/v1. To target LM Studio, MLX, or llama.cpp instead, set BACKEND=lm-studio, BACKEND=mlx, or BACKEND=llamacpp before running, or use /backend once the harness is running.

If agent.config.json does not exist, the first run opens a short setup wizard that writes one for you and probes the selected backend. Set SMALL_HARNESS_NO_WIZARD=true to skip the wizard and use env/defaults only.

Getting Started

1. Install a backend

Pick one. Ollama is the fastest path on a fresh box:

brew install ollama
brew services start ollama
ollama pull qwen2.5-coder:7b

LM Studio (already installed), MLX, and llama.cpp are also supported. See Backends for ports and setup notes.

2. Run the harness

cargo run --release

On a fresh checkout, the setup wizard asks for backend, hardware profile, optional model override, approval policy, and adaptive/fixed tool mode, then writes agent.config.json. After setup, you will see the banner, a backend probe, and a "Warming up" spinner that populates the prompt-eval cache so the first prompt isn't slow. When the input box opens, type a question:

> what files are in src/?

3. Switch backends, profiles, and models on the fly

/backend lm-studio        switch to LM Studio
/backend llamacpp         switch to llama.cpp
/setup                    rerun setup and rewrite agent.config.json
/profile mac-studio-32gb  switch the hardware profile (changes default model)
/model                    list models from the current backend and pick one
/tools                    show enabled tools and auto/fixed selection mode
/compare                  run the same prompt against OpenRouter cloud
/sessions                 list saved JSONL sessions
/resume latest            resume the newest saved session
/doctor                   check backend, config, rg, and session storage
/doctor --deep            probe stream, usage, and tool-call capabilities

4. Adjust the tool set for speed

Each tool definition costs prompt-eval time on small local models. Small Harness defaults to toolSelection: "auto", so ordinary chat sends no tool schemas, file/code questions send read/search/list schemas, edit requests add edit/patch schemas, and shell-ish prompts add shell when it is enabled. The tools list is the allowed pool:

/tools auto                    adaptive tool selection (default)
/tools fixed                   always send every enabled tool schema
/tools file_read,grep,list_dir
/tools auto file_read,grep,list_dir

Or set persistently in agent.config.json:

{ "tools": ["file_read", "file_edit", "grep", "list_dir"] }

Backends

Backend	Default URL	API style	Best for
`ollama`	`http://localhost:11434/v1`	OpenAI-compatible	Easiest setup; mature tool-call templates; CLI model management
`lm-studio`	`http://localhost:1234/v1`	OpenAI-compatible	GUI model browser; explicit load/unload controls
`mlx`	`http://localhost:8080/v1`	OpenAI-compatible (via `mlx_lm.server`)	Fastest inference on Apple Silicon
`llamacpp`	`http://localhost:8080/v1`	OpenAI-compatible (via `llama-server`)	Direct GGUF serving; fastest path if you already use llama.cpp
`openrouter`	`https://openrouter.ai/api/v1`	OpenAI-compatible	Cloud A/B comparison; access to larger frontier models

Override URLs with OLLAMA_BASE_URL, LM_STUDIO_BASE_URL, MLX_BASE_URL, or LLAMACPP_BASE_URL. openrouter requires OPENROUTER_API_KEY. llamacpp uses LLAMACPP_API_KEY only if your llama-server enforces one.

Why not OpenRouter's Responses API?

The official @openrouter/agent SDK speaks OpenRouter's newer /responses endpoint. Small Harness uses a hand-rolled reqwest + SSE client pointed at each backend's baseURL because /v1/chat/completions is the common shape across the supported local servers and OpenRouter cloud, even when a backend also exposes newer endpoints.

Tools

Tool	Default	Approval	What it does
`apply_patch`	off	yes	Validate and apply a unified diff with `git apply --check`
`file_read`	on	no*	Read a file (text or image base64) with optional offset/limit
`file_edit`	on	yes	Search-and-replace edits with unique-match validation, returns unified diff
`grep`	on	no	Regex search file contents (uses ripgrep)
`list_dir`	on	no*	List directory entries, alphabetical, capped at 500
`file_write`	off	yes	Write/create a file (overwrites)
`glob`	off	no*	Find files by glob pattern
`shell`	off	yes	Run a shell command, output capped at 256 KB

* Read-only tools prompt when outsideWorkspace is prompt and the request targets a path outside workspaceRoot.

Toggle the active set per session with /tools, per shell with the AGENT_TOOLS env var, or persistently in agent.config.json.

Approval policies

Policy	Behavior
`always` (default)	Every call to a mutating tool prompts you
`dangerous-only`	Only `shell` calls matching `rm`, `sudo`, `chmod`, `dd`, `mkfs`, etc. prompt; safer commands run silently
`never`	No prompts (use only when you trust the model)

At each prompt you can choose [y]es, [n]o, [a]lways for this tool, or [s]ession-allow this exact call. The session cache resets on /new.

Slash Commands

Command	Description
`/help`	List available commands
`/setup`	Run the setup wizard, write `agent.config.json`, probe the backend, and apply the new config
`/new`	Start a fresh conversation
`/clear`	Clear the screen
`/config`	Show resolved backend, model, workspace, history, display, and context config
`/session`	Show backend, model, approval policy, session path, message count, total tokens
`/sessions`	List saved sessions under `.sessions/`
`/resume latest\|<id>`	Resume a saved session
`/export current\|<id> [markdown\|json] [path]`	Export a session transcript
`/backend [name]`	Switch backend (`ollama`, `lm-studio`, `mlx`, `llamacpp`, `openrouter`)
`/profile [name]`	Switch hardware profile (`mac-mini-16gb`, `mac-studio-32gb`)
`/model [id]`	List models from the current backend and pick one, or set directly
`/tools [auto\|fixed\|list]`	Show enabled tools, switch adaptive mode, or set the enabled pool: `/tools auto file_read,grep,list_dir`
`/compare [model]`	Re-send the last user message to OpenRouter cloud for A/B
`/context [maxMessages=N maxBytes=N]`	Show prompt budget, active adaptive tools, byte/token estimate, and context limits
`/compact [keep]`	Summarize older turns into a compact continuation session
`/doctor`	Check backend reachability, model list, `rg`, config, and session storage
`/doctor --deep [all]`	Probe OpenAI-compatible streaming, usage chunks, native tool calls, and inline JSON fallback, then save JSON/Markdown reports under `.sessions/doctor/`
`/bench [model]`	Measure warmup, first-token, total latency, and output rate
`/eval [prompt-file] [models]`	Run saved prompts against one or more models with tools off/on
`exit`	Quit

/doctor --deep checks the active backend. Add all to probe every configured backend with short timeouts; unreachable backends show as failed rows in the capability table.

Hardware Profiles

The profile drives the default model per backend. You can always override with AGENT_MODEL or /model.

Profile	Default Ollama model	Default LM Studio model	Default MLX model	Default llama.cpp model
`mac-mini-16gb`	`qwen2.5-coder:7b`	`qwen2.5-coder-7b-instruct`	`mlx-community/Qwen2.5-Coder-7B-Instruct-4bit`	`gpt-3.5-turbo`
`mac-studio-32gb`	`qwen2.5-coder:14b`	`qwen2.5-coder-14b-instruct`	`mlx-community/Qwen2.5-Coder-14B-Instruct-4bit`	`gpt-3.5-turbo`

The OpenRouter cloud default for both profiles is qwen/qwen-2.5-coder-32b-instruct. The llama.cpp default mirrors the llama-server OpenAI-compatible examples; use /model or start llama-server with --alias if you want the loaded GGUF to advertise a specific model id.

Warmup

llama.cpp and llama.cpp-derived engines cache the prompt-eval result for any prefix they have already seen. At startup, Small Harness sends a tiny chat-completions request with the full system prompt + tool definitions and max_tokens: 1. That populates the cache, so your first real prompt only has to evaluate the new user tokens — typically dropping first-prompt latency from ~12 s to ~2 s on a 7B q4 model.

Disable with WARMUP=false if you want a faster startup at the cost of a slow first prompt.

The cache becomes stale when you change /backend, /model, or /tools. The next prompt after a switch will pay the prompt-eval cost again.

Configuration

Environment variables

# Backend selection: ollama (default), lm-studio, mlx, llamacpp, openrouter
BACKEND=ollama

# Hardware profile: mac-mini-16gb (default) or mac-studio-32gb
PROFILE=mac-mini-16gb

# Override the model for the chosen backend
AGENT_MODEL=qwen2.5-coder:14b

# Per-backend endpoint overrides
OLLAMA_BASE_URL=http://localhost:11434/v1
LM_STUDIO_BASE_URL=http://localhost:1234/v1
MLX_BASE_URL=http://localhost:8080/v1
LLAMACPP_BASE_URL=http://localhost:8080/v1

# Optional if llama-server was started with API-key enforcement
LLAMACPP_API_KEY=sk-no-key-required

# Required when BACKEND=openrouter or you want /compare
OPENROUTER_API_KEY=sk-or-...

# Approval policy: always (default) | never | dangerous-only
APPROVAL_POLICY=always

# Active tools, comma-separated. Default: file_read,file_edit,grep,list_dir
AGENT_TOOLS=file_read,file_edit,grep,list_dir

# Tool schema selection: auto (default) or fixed
AGENT_TOOL_SELECTION=auto

# Pre-warm the model at startup (default: on)
WARMUP=true

# Skip first-run setup and rely on env vars / built-in defaults
SMALL_HARNESS_NO_WIZARD=false

# Maximum agent steps per turn
AGENT_MAX_STEPS=20

# Workspace safety: prompt (default), deny, allow
WORKSPACE_ROOT=/path/to/project
OUTSIDE_WORKSPACE=prompt

# Context/history tuning
AGENT_CONTEXT_MAX_MESSAGES=40
AGENT_CONTEXT_MAX_BYTES=262144
AGENT_HISTORY=true
AGENT_HISTORY_MAX_ENTRIES=200

`agent.config.json`

For project-level defaults, run /setup or drop a JSON file in the repo root. Anything you put here can be overridden by env vars or slash commands at runtime.

{
  "backend": "ollama",
  "profile": "mac-mini-16gb",
  "approvalPolicy": "dangerous-only",
  "tools": ["file_read", "file_edit", "grep", "list_dir"],
  "toolSelection": "auto",
  "maxSteps": 20,
  "workspaceRoot": "/path/to/project",
  "outsideWorkspace": "prompt",
  "context": {
    "maxMessages": 40,
    "maxBytes": 262144
  },
  "history": {
    "enabled": true,
    "maxEntries": 200
  },
  "profiles": {
    "mac-studio-fast": {
      "ollama": "qwen2.5-coder:14b",
      "llamacpp": "gpt-3.5-turbo",
      "openrouter": "qwen/qwen-2.5-coder-32b-instruct"
    }
  },
  "display": {
    "toolDisplay": "grouped",
    "inputStyle": "bordered",
    "loaderStyle": "spinner",
    "loaderText": "Thinking",
    "showBanner": true
  }
}

Resolution order

Slash command overrides at runtime
Process environment variables (BACKEND, PROFILE, AGENT_MODEL, AGENT_TOOLS, …)
.env.local, then .env
agent.config.json in the working directory
Built-in defaults

Architecture

                +-------------------------+
                |        main.rs          |
                |  banner / input loop /  |
                |  warmup / approval      |
                +------------+------------+
                             |
                             v
+--------------+    +-------------------------+    +-------------------+
|  config.rs   |--->|        agent.rs         |<-->|   tools/*.rs      |
|  dotenv+JSON |    |  chat/completions loop  |    |  serde-typed,     |
|  + profiles  |    |  streaming + tool calls |    |  approval-gated   |
+--------------+    +------------+------------+    +-------------------+
                                 |
                                 v
                +-------------------------+
                |     backends.rs         |
                |  Ollama / LM Studio /   |
                |  MLX / llama.cpp /      |
                |  OpenRouter             |
                +-------------------------+
                             |
                             v
                +-------------------------+
                |   session.rs            |
                |  JSONL sessions/export  |
                +-------------------------+

Development

cargo check                # type-check without producing a binary
cargo run                  # debug build + run (faster compile, slower runtime)
cargo run --release        # optimized build + run
cargo build --release      # produce target/release/small-harness

Project layout:

src/
  main.rs             entry — input loop, loader, approval wiring, warmup
  agent.rs            chat/completions runner with tool calls + streaming
  backends.rs         Ollama / LM Studio / MLX / llama.cpp / OpenRouter endpoints + defaults
  config.rs           dotenv + agent.config.json loader, workspace/context/history config
  approval.rs         y/n/always/session-allow prompt with diff previews
  session.rs          JSONL conversation log, listing, resume, export helpers
  warmup.rs           pre-warm the prompt-eval cache at startup
  commands.rs         slash commands for sessions, config, backends, evals, doctor, bench
  renderer.rs         grouped tool display
  loader.rs           spinner / gradient / minimal loaders
  banner.rs           ASCII banner + dynamic backend/profile/model line
  input.rs            bordered + plain readers with history and multi-line input
  openai.rs           wire types + SSE streaming for chat completions
  tools/              apply_patch, file_read, file_write, file_edit, glob_tool, grep, list_dir, shell

Quality expectations:

cargo check must pass cleanly.
Tools that mutate filesystem state implement require_approval on the Tool trait (returning true, or computing it from the args for dangerous shapes — see shell.rs).
New backends should expose an OpenAI-compatible /v1/chat/completions endpoint and add a profile-default model map in backends.rs.

Versioning:

Small Harness stays on the 0.1.x line before a larger product milestone.
The patch number tracks the total repo commit count for the release commit. This setup release is 0.1.30: 29 commits were already on main, and the release commit is expected to be commit 30.
Release tags should use a leading v, for example v0.1.30.

Troubleshooting

`Backend not reachable: Connection error`

The harness probes the backend at startup. If you see this message, the named backend is not listening on the expected port. Suggestions:

Ollama: brew services start ollama, or run ollama serve in a separate terminal. Default port 11434.
LM Studio: open the app, go to "Local Server", click Start. Default port 1234.
MLX: start mlx_lm.server --port 8080 against an MLX-format model.
llama.cpp: start llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 8080. Add --jinja when you want native OpenAI-style tool calls.
OpenRouter: set OPENROUTER_API_KEY in .env.

For backend-specific capability problems, run /doctor --deep. It exercises /v1/models, streaming chat completions, usage chunks, a harmless tool-call schema, and Small Harness' inline JSON fallback detector. Reports are saved to .sessions/doctor/ for sharing or comparison.

First prompt is slow even with warmup

If you change /backend, /model, /tools, or the hardware profile after warmup, the cached prefix becomes stale and the next prompt re-evaluates the new system prompt + tools. This is one-time per change.

Model returns tool calls as text JSON

Some small-model templates emit tool calls as plain content (e.g. {"name":"shell","arguments":{...}}) instead of populating the tool_calls field. Small Harness detects this pattern and synthesizes a real tool call. If a particular model still misbehaves, switching to llama3.1:8b (which has well-tested tool-call templates) usually resolves it.

Model responds in another language unexpectedly

Some bilingual models (notably the qwen family) drift into Chinese on short greetings. The system prompt now includes an explicit language directive, but you can strengthen it further by editing SYSTEM_PROMPT in src/config.rs.

`cargo: command not found`

Install Rust via rustup: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh.

License

Small Harness is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
docs/assets		docs/assets
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Small Harness

What Is Small Harness?

Features

Quick Install

Getting Started

1. Install a backend

2. Run the harness

3. Switch backends, profiles, and models on the fly

4. Adjust the tool set for speed

Backends

Why not OpenRouter's Responses API?

Tools

Approval policies

Slash Commands

Hardware Profiles

Warmup

Configuration

Environment variables

agent.config.json

Resolution order

Architecture

Development

Troubleshooting

Backend not reachable: Connection error

First prompt is slow even with warmup

Model returns tool calls as text JSON

Model responds in another language unexpectedly

cargo: command not found

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`agent.config.json`

`Backend not reachable: Connection error`

`cargo: command not found`

Packages