Hard enforcement for local LLMs running on Pi coding agent.
Local models (35B and under) spiral, forget, and write 800-line files in one shot. PiForge physically prevents that — at the API boundary, not the prompt level — and gives the model an external brain via .think/ files that survive context compression.
Tested with qwen3.6-35b-a3b at Q2_K_XL quantization via LM Studio on macOS. Yes — a 2-bit quantized model doing structured multi-file coding, codebase distillation, and tool-call workflows. The guard stack makes that possible.
| Extension | What it enforces | Default |
|---|---|---|
incremental-guard.ts |
Rejects write/edit calls > 100 lines or 6000 chars — forces skeleton → edit workflow | on |
thinking-guard.ts |
Injects correction when thinking block > 2000 chars — stops reasoning spirals | on |
context-monitor.ts |
Steers model to write state files at 65% context, urgent at 80% | on |
analysis-guard.ts |
Forces findings to .think/step-NNN.md when response > 1000 chars with no file write |
on |
state-guard.ts |
Blocks source reads until _state.md is read; forces updates every 5 turns |
on |
first-prompt.ts |
Appends "plan in steps, implement one at a time" to first prompt — preventive, zero context overhead | on |
plan-clarify.ts |
Intercepts _plan.md writes — forces model to ask ≤3 clarifying questions before any code |
off |
knowledge-injector.ts |
Isolated LLM call selects relevant ~/.pi/knowledge/ files, saves manifest, auto re-injects after compaction. /forget to remove. |
off |
These are hard — the model cannot bypass them. incremental-guard and knowledge-injector physically reject tool calls. The others inject steering messages before the next LLM call.
plan-clarify and knowledge-injector are disabled by default — enable per session with /piforge enable <name>. Use /piforge to see status.
A local model with 50k context can't hold a real codebase. Reading files one by one is slow, burns context, and the model forgets file #1 by the time it reads file #10. Distill solves this by building compressed versions of the entire codebase at multiple zoom levels — like Google Maps for your code.
The idea: You distill your codebase once. This creates three levels of compressed summaries, all mirroring the original folder structure:
Source (100%) → L1 (~50%) → L2 (~25%) → L3 (~12%)
full code key logic signatures one-liners
When Pi needs to understand the codebase, it doesn't read source files. It queries the right zoom level:
- L3 — "What modules exist? What's the architecture?" — fits in a few hundred tokens
- L2 — "How does the auth system work?" — function signatures, key relationships
- L1 — "Show me the output pipeline logic" — detailed summaries with key code preserved
Pi zooms in only when needed. Most questions are answered at L2/L3 without ever reading source. When Pi does need the actual code, it knows exactly which file to open because L2 already told it where things live.
How it works: Crawls the directory, builds an import graph, topologically sorts files, and processes each file via isolated sub-Pi calls — the main session LLM stays idle and clean. The distilled knowledge persists across sessions.
| Extension | What it does | Default |
|---|---|---|
distill.ts |
/distill command + distill_codebase LLM-callable tool |
on |
distill-query.ts |
/l1, /l2, /l3 query commands + /distill-status |
on |
explore.ts |
/explore + explore_codebase tool (superseded by distill-query) |
off |
distill-awareness.ts |
Session-start context injection (superseded by distill-query) | off |
Additional features:
- Purpose-driven notes:
--purpose "how does auth work?"takes notes on each file during distillation, then synthesizes a comprehensive answer - LLM-callable tool: Pi can call
distill_codebaseautonomously — no slash command needed - Single file support: Distill one large file with automatic chunking
- Auto-detect level: Point at
.think/distill/L1/and it auto-outputs L2 - Resume support:
--resumecontinues interrupted distillation
/distill [path] # distill directory (default: .)
/distill [path] --purpose "question" # distill + take notes on question
/distill --resume # resume interrupted run
/distill --level 2 # compress L1 → L2
/distill [path] --ratio 30 # aggressive compression (30%)
/l1 "how does auth work?" # query L1 summaries directly
/l2 "what modules exist?" # query L2 summaries directly
/l3 "high-level architecture?" # query L3 summaries directly
/distill-status # show coverage per level
Output structure:
.think/distill/
├── manifest.json ← state: files, progress per level, config
├── distill.log ← append-only log
├── L1/ ← mirrors source folder structure, ~50% of source
│ └── src/
│ └── auth.ts.md
├── L2/ ← same structure, ~25% of source
│ └── src/
│ └── auth.ts.md
├── notes/ ← purpose-driven findings (optional)
│ ├── auth-notes.md
│ └── auth-notes-answer.md
└── tmp/ ← prompt files (auto-cleaned)
| Extension | What it does | Default |
|---|---|---|
session-manager.ts |
Auto-creates isolated .think/ per Pi terminal instance via symlinks |
on |
Every time you open a new Pi terminal, session-manager creates a fresh directory under .think-sessions/ and points the .think/ symlink to it. The model always writes to .think/ — same hardcoded path, zero tokens wasted on session management.
.think-sessions/
session-001/ ← first Pi tab's state
session-002/ ← second Pi tab's state
session-003/ ← third Pi tab's state
.think/ → .think-sessions/session-003/ ← symlink to active session
If .think/ already exists as a real directory (from before the extension), it gets migrated automatically into session-001.
Commands: /sessions (list all), /resume (list + pick), /resume session-001 (switch directly — injects steer to read _state.md)
| Extension | What it does | Default |
|---|---|---|
purpose-anchor.ts |
Captures session purpose from first prompt, re-injects purpose + state after compaction | on |
When context gets compacted, Pi can lose track of the original goal. purpose-anchor solves this:
- Saves first user prompt to
.think/_purpose.md - Hooks into Pi's
session_compactevent - After compaction, steers Pi to re-read
.think/_state.mdand_summary.md - Pi re-orients and continues without drift
Commands: /purpose (view/set), /purpose-clear (reset)
incremental-codegen — SKILL.md that teaches the model the skeleton → edit workflow. Works alongside the hard guards.
knowledge/ — inference-time context injection with zero context pollution.
On turn 1, knowledge-injector makes an isolated LLM call using Pi's own model and endpoint. It passes the user's prompt + the knowledge filenames and asks "which are relevant?". The selection reasoning happens in that isolated call — it never appears in Pi's conversation history. Only the selected file content gets injected as a steer.
This means: smart semantic selection (the LLM knows the task), zero reasoning trace in context.
user prompt → isolated call → selects files → injects content only → Pi's main LLM call
Selected filenames are saved to .think/_knowledge-manifest.md. After compaction or session restart, the extension reads the manifest, rebuilds the content from source files, and re-injects automatically — zero LLM cost, no re-selection needed. Use /forget <name> to remove knowledge mid-session.
Code writes are blocked until .think/_knowledge.md is written — proof the model absorbed the knowledge.
Included samples:
svelte5-gotchas.md— Svelte 5 runes failure patternsastro-gotchas.md— Astro islands, client directives, frontmatter pitfalls
Add your own — name by tech, keep under 500 tokens, failures only:
~/.pi/knowledge/
├── astro-gotchas.md
├── svelte5-gotchas.md
├── react-hooks.md
└── ...
project-template/AGENTS.md — drop into any project. Tells the model to use the .think/ external brain workflow: scan knowledge folder at session start, read _state.md first, write one step file per turn, update state after every action.
git clone https://github.com/yourusername/piforge
cd piforge
bash install.shThen:
- Start LM Studio, load your model, start the server on
:1234 - Edit
~/.pi/agent/models.json— set the modelidto match your LM Studio model - Copy
project-template/AGENTS.mdinto any project you work on - Run
pifrom your project directory
On startup you should see:
incremental-guard active (max 100 lines / 6000 chars per write/edit)
thinking-guard active (max 2000 chars / 60 lines of thinking per turn)
context-monitor active — warn at 65%, urgent at 80% (window: XXXXX tokens)
analysis-guard active (triggers on responses >1000 chars with no file write)
session-manager: session-001 — .think/ ready
- Pi coding agent —
npm install -g @mariozechner/pi-coding-agent - LM Studio with a model loaded and server running on
:1234 - Node.js ≥ 20
Recommended model: qwen3.6-35b-a3b at Q2_K_XL quantization (Unsloth). Runs on consumer hardware via LM Studio.
We develop and test PiForge at Q2_K_XL — the most aggressive quantization level. The results at 2-bit are already surprisingly good. At higher quantizations, they only get better.
Also tested with qwen3-coder-30b-a3b-instruct. Should work with any OpenAI-compatible local server.
Add this in LM Studio → Model → System Prompt:
CRITICAL OUTPUT RULE: You MUST NEVER write more than 2000 tokens in a single tool call.
When generating a new file:
- First call: write ONLY the <head> and <style> section
- Second call: use bash to append the <body> HTML: cat >> file.html << 'CHUNK'
- Third call: use bash to append the <script> section
- NEVER put an entire HTML file in one write call
When the file would be large, ALWAYS use multiple bash append calls.
DO NOT OVERTHINK. Short thinking is better than long thinking.
Note: the Pi
incremental-guardextension enforces this at the API layer regardless — the system prompt is a soft nudge on top.
| Parameter | Value | Notes |
|---|---|---|
| Temperature | 0.58 |
Focused but not robotic |
| Response length limit | 2000 tokens |
Backstop — guards are the real enforcement |
| Top-K sampling | 30 |
Narrows token selection |
| Repeat penalty | 1.1 |
Mild reduction of token-level loops |
| Top-P sampling | 0.95 |
Standard nucleus sampling |
| Min-P sampling | 0.08 |
Cuts low-probability tail tokens |
The response length limit is not always respected by local models — treat it as a last-resort backstop, not primary enforcement. The guard stack handles the real enforcement.
Cloud models (GPT-4, Claude, Gemini) self-regulate well enough that you don't need enforcement. Local 35B models don't — they ignore prompt rules, spiral in reasoning loops, and produce truncated garbage when they try to write large files.
The existing local LLM tooling (Cline, Roo, etc.) is designed for cloud models and just pointed at local endpoints. PiForge is built specifically for the constraints of local inference:
- Hard limits at the API layer, not suggestions in a prompt
- External memory via
.think/files — the model writes everything to disk instead of holding it in context - Distillation — build a knowledge base from a codebase once, reference it across sessions without re-reading source files
A scalpel isn't better than a chainsaw because it's sharper — it's better because you're doing surgery, not cutting trees.
PiForge doesn't make a Q2 quantized model smart. It removes every decision the model is bad at, until what remains is a narrow set of small, recoverable tasks it can do reliably. The right tool constrained to the right task performs well regardless of raw capability.
See PI-SETUP.md for the complete reference — every config option, tuning guide, benchmark results, and troubleshooting section.
piforge/
├── README.md
├── install.sh ← run this first
├── PI-SETUP.md ← full reference guide
├── distill-v2-plan.md ← distill design document
├── distill-v2-implementation.md ← distill implementation spec
├── extensions/
│ ├── incremental-guard.ts ← blocks oversized write/edit calls
│ ├── thinking-guard.ts ← stops reasoning spirals
│ ├── context-monitor.ts ← warns before context degrades
│ ├── analysis-guard.ts ← forces analysis to disk
│ ├── token-counter.ts ← tracks tokens + Gemini cost comparison
│ ├── first-prompt.ts ← injects planning instruction into first prompt only
│ ├── plan-clarify.ts ← clarifying questions after _plan.md (off by default)
│ ├── knowledge-injector.ts ← isolated LLM call selects knowledge files (off by default)
│ ├── state-guard.ts ← blocks reads until _state.md read, forces updates
│ ├── piforge-manager.ts ← /piforge command to toggle extensions
│ ├── distill.ts ← /distill + distill_codebase tool
│ ├── distill-query.ts ← /l1 /l2 /l3 direct level queries + /distill-status
│ ├── explore.ts ← /explore + explore_codebase tool (off by default)
│ ├── distill-awareness.ts ← session-start awareness (off by default)
│ ├── purpose-anchor.ts ← anti-drift: re-injects purpose after compaction
│ └── session-manager.ts ← per-tab .think/ isolation via symlinks
├── knowledge/
│ ├── README.md ← how to write knowledge files
│ ├── svelte5-gotchas.md ← Svelte 5 runes failure patterns
│ └── astro-gotchas.md ← Astro islands + client directives failure patterns
├── skills/
│ └── incremental-codegen/
│ └── SKILL.md ← soft-enforcement skill
├── config/
│ ├── models.json ← LM Studio provider config template
│ ├── settings.json ← Pi global settings
│ └── piforge.json ← extension toggles (plan-clarify + knowledge-injector off by default)
└── project-template/
└── AGENTS.md ← drop in any project