An agent skill that routes summarisation, triage, and bulk-text tasks to locally-installed models (Ollama or MLX) instead of the cloud API. Keeps content on-device, preserves the agent's context window, and uses llmfit to keep the model set current.
One linear path from nothing to a first delegated call:
# 1. Install (symlinks the skill into every detected agent tool)
npx skills add IsmaelMartinez/delegate-local
# 2. Confirm at least one local model is installed and see how tiers route
bash ~/.claude/skills/delegate-local/scripts/audit-models.sh
# 3. Make your first delegated call
git diff | bash ~/.claude/skills/delegate-local/scripts/delegate.sh prose "Summarise this diff in 3 bullets."Step 2 requires Ollama (or mlx-lm on Apple Silicon) with a model pulled — see Requirements and Backends. The rest of this README covers install options, backend selection, and the routing internals.
When a task fits the "gather context once, send one prompt, return text" pattern — log triage, commit-message drafting, batch classification, structured extraction, prose rewriting, format conversion, regex generation, docstring stubbing — the agent delegates to a local model via delegate.sh instead of handling it itself. Reasoning, tool-calling, and repo-wide tasks still go to the cloud model.
The skill auto-delegates by default. Saying "delegate where it fits" or "auto-delegate" once locks that behaviour for the rest of the conversation.
Core pattern (from local-brain) — resolve a tier to the best installed model, pipe context in, get text back:
git diff HEAD~5 | bash scripts/delegate.sh prose "Summarise in 3 bullets."delegate.sh handles backend selection (Ollama or MLX via auto-probe), model resolution, metrics logging, and returns clean text with no ANSI artifacts.
- Ollama with at least one model pulled (the default backend), or
mlx-lmon Apple Silicon (see Backends below) jq(foraudit-models.sh)llmfit(optional, enables upgrade suggestions based on your hardware)
The default is DELEGATE_BACKEND=auto: on each call, delegate.sh probes localhost:8080 for an MLX server. If it responds, the call routes through MLX (Apple's Metal-native runtime — lower memory, faster prefill). Otherwise it falls back to Ollama. Non-Apple-Silicon hosts always fall through to Ollama transparently.
On Apple Silicon, MLX is the recommended backend. Install and auto-start via launchd are documented in docs/install-mlx.md. The quick version:
python3 -m venv ~/venvs/mlx-lm && ~/venvs/mlx-lm/bin/pip install mlx-lm
~/venvs/mlx-lm/bin/huggingface-cli download mlx-community/Qwen3.6-35B-A3B-8bit
~/venvs/mlx-lm/bin/mlx_lm.server --model mlx-community/Qwen3.6-35B-A3B-8bit --port 8080 &Force a specific backend with DELEGATE_BACKEND=ollama or DELEGATE_BACKEND=mlx. The metrics JSONL tags each call with a backend field so scripts/metrics-summary.sh can break down latency per backend.
Use Vercel Labs' skills CLI, which symlinks the skill into every detected agent tool (Claude Code, Codex, OpenCode, Cursor, Copilot, and many others) so updates propagate everywhere at once:
npx skills add IsmaelMartinez/delegate-localPass -g to install globally (~/<agent>/skills/) instead of per-project, --copy to make independent copies on systems without symlink support, or -a claude-code to limit to a specific agent.
When the universal install is the wrong fit (per-machine routing, MCP-only consumers, AAIF-only setups), the per-tool docs cover the specifics:
The skill is conformant with the Agent Skills standard — SKILL.md at the directory root with name and description frontmatter — so any tool that reads that format can use it. The repo is also AAIF-discoverable directly: a symlink at .agents/skills/delegate-local points at the repo root, so tools that scan the AAIF layout (Cursor, Copilot, OpenCode) find the skill without per-tool copying. For tools that do not support AAIF discovery, drop the directory into the tool's expected skills path:
git clone https://github.com/IsmaelMartinez/delegate-local
cp -r delegate-local ~/.claude/skills/ # or your tool's skills dirAfter install, run the audit from wherever the skill landed:
bash <install-path>/scripts/audit-models.shThe shipped pick-model.sh is one preference list for everyone. To override the order on a specific machine without forking the repo, drop a bash file at ~/.claude/skills/delegate-local/config.sh. pick-model.sh sources it after the shipped defaults are set, so any tier the file touches wins. Untouched tiers fall through to shipped defaults; an absent file changes nothing.
Trust note:
config.shis sourced as bash bypick-model.sh, meaning its contents execute with your environment and privileges. This is arbitrary code execution by design, similar to~/.aiderrcor~/.claude/settings.local.json. Only place aconfig.shyou wrote yourself or fully trust at that path; never paste one from an untrusted source. See SECURITY.md for the full trust model.
# ~/.claude/skills/delegate-local/config.sh
case "$tier" in
prose) prefs=("gemma4" "qwen3.6" "qwen3-next") ;;
esacscripts/init.sh writes a starter override based on what's currently installed — read-only, prints to stdout, never auto-writes:
bash <install-path>/scripts/init.sh > ~/.claude/skills/delegate-local/config.shSet DELEGATE_LOCAL_CONFIG=/some/other/path.sh to redirect the override path (useful for testing or per-project overrides).
SKILL.md— triggering description and usage patterns the agent reads.scripts/delegate.sh <tier> "<prompt>"— wrapspick-model.sh+ the backend's HTTP API (Ollama or MLX, auto-selected) withthink:falseandtemperature:0defaults. Appends one JSON line per call to~/.claude/skills/delegate-local/metrics.jsonl. Use this instead of bareollama runor hand-rolledcurlcalls.scripts/pick-model.sh <tier>— resolves a tier to the best installed model via substring preference lists. Tiers arecode,prose,reasoning, andlong-context(active), plusvision,embedding,premium-general, andreasoning-vision(scaffolded). Edit this file (not the skill body) when your installed set changes.scripts/audit-models.sh— prints installed models, tier routing, and llmfit-driven upgrade suggestions filtered to first-party providers. Read-only; never pulls.scripts/metrics-summary.sh— reads the metrics JSONL and prints volume per tier, p50/p95 latency, total tokens-avoided, and top models by frequency. Read-only.tests/— unit tests for every script. Run withbash tests/run-tests.sh(and the per-scriptbash tests/test-*.sh).mcp/— optional Python MCP server that exposespick_model,audit_models, andlist_tiersto non-Claude tools (Codex, OpenCode, Cursor, custom MCP clients). Thin wrapper over the bash scripts, not a reimplementation. Seemcp/README.mdfor install and config snippets.docs/observability/— opt-in OTLP exporter. SetDELEGATE_OTEL_ENDPOINT=<url>and everydelegate.shcall POSTs an OTLP span (off by default, zero overhead when unset). Content is redacted by default; only metadata (tier, model, recipe, char counts, durations, verdict) travels to the collector. Three backends documented: Grafana Cloud, Langfuse, and Phoenix. Seedocs/otel-schema.mdfor the wire format.
Three scripts gate every PR via GitHub Actions:
scripts/validate-frontmatter.sh SKILL.md— asserts the SKILL.md frontmatter has required fields, thenamematches the directory, andnamematches the Claude Skills regex.scripts/validate-skill-content.sh SKILL.md— scans for eight categories of dangerous content (auth-disable, permissive flags, credential exfiltration, base64 obfuscation, zero-width / bidi unicode, broad tool grants, unresolved merge markers, external URLs). Justified false positives go in.content-check-allow.scripts/eval-skill-triggers.sh— validatesevals/eval-set.jsonshape by default; with--ollama [model]runs each tagged query through a local Ollama model (free, on-device; defaults topick-model.sh codewhich baselines at 1.000 / 1.000 against the current eval set on the reference host); with--github-models [model]runs against GitHub Models (free up to the per-model rate-limit tier; defaults toopenai/gpt-4o-miniwhich baselines at 0.900 / 1.000); with--apiandANTHROPIC_API_KEYset, runs against Claude — kept for the rare case Claude-grade scoring is wanted. All three modes use only the SKILL.md frontmatter description as the trigger surface and assert recall + negative-precision thresholds.
The --ollama mode is the recommended local pre-merge gate (10–30 s on a mid-tier machine, dogfoods the project's own routing). The --github-models mode is the recommended CI gate — uses the auto-provisioned GITHUB_TOKEN so there is no secret to configure; the workflow declares permissions: models: read to grant scope. The --api mode is opt-in via the ANTHROPIC_API_KEY repo secret (Settings → Secrets and variables → Actions); without the secret the CI step is skipped, not failed.
Recipes evolve from real session feedback. The loop is end-to-end on-device until you decide to share a finding:
delegate.sh run metrics.jsonl delegate-feedback.sh
↓ (append-only, miss "<reason>"
appends one row gitignored) ↓
per call appends one row
↓
matches reason
against historical
MISS rows (Jaccard
over content tokens)
↓
if N≥3 similars in
last 30d → nudge
prints draft gh
issue command
↓
you decide whether to
file a prompt-pattern
issue → maintainer
graduates it into
prompts/<new>.md
The single-machine metrics JSONL has no scheduled job behind it; the nudge is the runtime signal. After the third similar MISS in the rolling window, delegate-feedback.sh prints the matched reasons and a draft gh issue create command pre-targeted at the prompt-pattern label. The nudge is advisory — it never opens the issue on its own — so each filing stays a deliberate call. Silence one invocation with DELEGATE_FEEDBACK_NO_NUDGE=1; tune the trigger via DELEGATE_FEEDBACK_NUDGE_AT (default 3), DELEGATE_FEEDBACK_NUDGE_WINDOW_DAYS (default 30), and DELEGATE_FEEDBACK_SIMILAR_THRESHOLD (default 0.4 Jaccard over stopword-stripped content tokens).
scripts/audit-metrics.sh is the on-demand counterpart to that runtime nudge — the same matcher applied many-vs-many across the whole JSONL instead of one-vs-many at MISS time. Run it for periodic review, or to scan a cross-machine JSONL the per-MISS nudge would never see (the JSONL is gitignored, so each host has its own). The script reads DELEGATE_METRICS_FILE and honours the same DELEGATE_FEEDBACK_NUDGE_AT / _WINDOW_DAYS / _SIMILAR_THRESHOLD envs, prints one draft gh issue create command per recurring bucket, and never writes to the JSONL itself.
A prompt-pattern issue captures the task shape, tier and resolved model, verbatim prompt and model output, and (when known) the prompt that turned the MISS into a HIT. prompts/README.md documents how the maintainer graduates an issue into a prompts/<new>.md recipe paired with an evals/eval-set.json positive — closing the loop empirically rather than evaporating after one conversation.
Worth being explicit about this because the skill could easily be oversold.
The delegated call is where the savings live — the local model writes the summary, classification, or patch instead of Claude. That's real and measurable: the metrics rollup (scripts/metrics-summary.sh) reports a "tokens avoided" headline computed from real Ollama prompt_eval_count + eval_count counts. The 2026-05-04 v8 probe's "~250× cheaper than Opus" number is also real for the specific workload it measured — 18 minimal-patch code cells scored by pytest.
But a realistic delegation flow costs more Anthropic tokens than just "the delegated call minus zero." In a typical turn, Claude still spends tokens to:
- read the user's request and decide a local model fits,
- frame the delegation prompt,
- read the local model's response back,
- verify the response against the actual files (for correctness-critical work — see SKILL.md's Discipline subsection on running the test when delegating code, and the gather-delegate-verify pattern throughout).
For small tasks — a 3-bullet diff summary, a one-line commit message — the verification overhead can eat most of the headline saving. The shape where this skill actually pays off is bulk or repeated work: triage 50 TODOs, classify 40 findings against an allowlist, draft release notes from a 200-line changelog. There the per-item Anthropic cost is dominated by the delegated generation, which is the part that moves to local.
Rough annual estimates (char-count tokens, ~1.5 KTok per round-trip that would otherwise go to the API):
| Usage profile | Delegations/year | Haiku-equiv save | Sonnet-equiv save | Opus-equiv save |
|---|---|---|---|---|
| Subscription user (Claude Max/Pro) | any | $0 (marginal) | $0 | $0 |
| Casual API user (~10/hour × 20 h/wk) | ~10k | $25–40 | $75–115 | $375–560 |
| Heavy/team (~500/day × every day) | ~180k | $450–700 | $1.4k–2k | $6.8k–10k |
The subscription row is the honest one for individual developers: inside a Claude Max or Pro plan the marginal API cost of a delegation is zero, so the direct-dollar saving is zero. The real benefit in that case is different: context-window preservation (those tokens don't bloat Claude's active context, so long conversations stay focused) and privacy (sensitive logs, configs, and credentials never leave the machine).
More capable local models will shift these numbers but probably not by an order of magnitude. "250× cheaper than Opus" is a headline for one measured shape, not a general claim.
The skill intentionally avoids frameworks. Local models are good summarisers and weak agents; delegation is a shell pipe, not an orchestration layer. The pick-model.sh preference lists are the single point of truth for routing — no hardcoded model names in the skill body.
audit-models.sh cross-checks llmfit's installed flag against ollama list because llmfit tracks its own HuggingFace GGUF cache rather than Ollama's model store. It filters suggestions to Alibaba/Google/Meta/Microsoft/DeepSeek/Mistral/Zhipu so third-party fine-tunes that Ollama won't have under the same name don't pollute the output.
This skill sits at the intersection of three personal projects, and is observed by a fourth.
local-brain is the source of the framing this skill operationalises. The core finding — local models are strong summarisers and weak agents, so delegation is a shell pipe rather than an orchestration layer — comes directly from that work, and is why this skill is implemented as bash scripts rather than a framework.
ai-model-advisor supplies the tier classification (code / prose / reasoning / long-context) and the "smallest model sufficient" environmental philosophy that pick-model.sh encodes. When you change the preference order in that script, the rationale you are applying is the one ai-model-advisor argues for: bigger is not better when a 9GB model handles the prompt in half the time.
llmfit is an optional dependency that enables hardware-aware upgrade suggestions in audit-models.sh. When llmfit is not on PATH the audit prints routing only and skips the upgrade-check section with a hint. When it is present, the audit feeds llmfit's hardware-scored recommendations through a first-party-provider filter and surfaces upgrades that beat the installed leader by 3+ points. Patterns the audit script learns about Ollama-vs-HuggingFace name mappings (hf_stem normalisation) flow back to llmfit when worth generalising.
repo-butler tracks repo health across the portfolio. No integration work is needed here — repo-butler picks up new repos automatically once they exist on GitHub, and this one is now visible to it.
A monthly reminder to re-run scripts/audit-models.sh is automated via .github/workflows/monthly-audit-reminder.yml. The workflow opens a tracking issue on the 1st of each month (idempotent — skips when one is already open) because the audit needs a local ollama list and can't run on the hosted runner; workflow_dispatch is the manual escape hatch.
MIT — reuse freely.