A Claude Code harness that enforces discipline for shipping production agents. Built for LangGraph + FastAPI + GCP, but the core primitives (hooks, subagents, commands, context management) work with any stack.
Hone solves the gap between "Claude Code can write code" and "Claude Code reliably ships production software" — tight TDD, blocked secrets, automatic formatting, proactive context management, and a structured plan-build-check-score workflow.
cd your-project
npx create-hone@latestThat's it. The installer copies the harness files into your project, makes hooks executable, merges .gitignore entries, and tells you what to do next. It won't overwrite existing files.
- Installation
- How It Works
- The Workflow
- Context Management
- Hooks
- Subagents
- Commands
- Skills
- Eval Harness
- Customizing
- Rollout Order
- File Reference
cd your-project
npx create-hone@latestOr with curl:
curl -fsSL https://raw.githubusercontent.com/GabrielRenno/hone/main/install.sh | bashThe installer:
- Clones Hone into a temp directory
- Copies
.claude/,docs/,evals/,CLAUDE.md, and.github/workflows/eval.yml - Skips files that already exist (won't overwrite your work)
- Makes hooks executable
- Merges
.gitignoreentries - Cleans up the temp directory
If you prefer to do it yourself:
git clone https://github.com/GabrielRenno/hone.git /tmp/hone
cp -r /tmp/hone/.claude .
cp -r /tmp/hone/docs .
cp -r /tmp/hone/evals .
cp /tmp/hone/CLAUDE.md .
mkdir -p .github/workflows && cp /tmp/hone/.github/workflows/eval.yml .github/workflows/
chmod +x .claude/hooks/*.sh
rm -rf /tmp/hone# 1. Edit CLAUDE.md — replace <project-name>, trim the stack section to match yours
vi CLAUDE.md
# 2. Verify toolchain
python3 --version && ruff --version && mypy --version && pytest --version
# 3. Run a smoke test — open Claude Code and type:
# /plan add a /healthz endpoint that returns the current server time
claudeThe PRD interview should kick off, ask questions, and write docs/plans/healthz.md. Then run /build healthz — the test-writer subagent writes failing tests, the main agent implements, hooks format and typecheck, and the Stop hook blocks if anything's red.
- Claude Code installed (
npm install -g @anthropic-ai/claude-code) git,python3on your PATHruff,mypy,pytest(install as dev deps or globally)gcloud,gh(optional, for GCP deploys and GitHub ops)
Hone is not an application — it's a configuration layer. It consists of five primitives that shape how Claude Code behaves in your project:
A lean file (~60 lines) that tells Claude Code your stack, architecture, non-negotiables (TDD, @traceable, Pydantic, etc.), and workflow. Claude reads this at the start of every session.
It also contains Compact Instructions — rules that tell the auto-compactor what to preserve (decisions, file paths, test state) and what to discard (raw file contents, verbose tool output). This is critical for long sessions.
Six shell scripts in .claude/hooks/ that run automatically at lifecycle points:
| Hook | When it runs | What it does |
|---|---|---|
block-secrets.sh |
Before any Read/Edit/Write | Blocks access to .env, .pem, .key, credential files |
block-protected.sh |
Before any Edit/Write | Blocks edits to migrations/, .git/, node_modules/, .venv/ |
format-on-write.sh |
After any Edit/Write (async) | Runs ruff format + ruff check --fix on edited Python files |
typecheck-on-write.sh |
After any Edit/Write | Runs mypy --strict on edited files, surfaces errors |
tests-must-pass.sh |
When Claude tries to stop | Blocks Claude from ending its turn if tests are failing |
inject-context.sh |
On every user prompt | Injects git branch, recent commits, active plan, checkpoint state |
Hooks are configured in .claude/settings.json alongside a permission denylist that blocks rm -rf, git push --force, and git reset --hard.
Three subagents in .claude/agents/ with constrained tool access:
| Subagent | Tools | Purpose |
|---|---|---|
explorer |
Read, Grep, Glob | Read-only repo navigation. Returns precise file paths and line numbers, never raw content dumps. |
critic |
Read, Grep, Bash | Code review. Surfaces correctness, scope creep, security, and architecture issues. Ignores style (Ruff handles that). |
test-writer |
Read, Grep, Edit, Write | Writes failing pytest tests. Never writes implementation code. |
Subagents run in isolated context — their verbose work doesn't pollute the main conversation.
Five slash commands in .claude/commands/:
| Command | What it does |
|---|---|
/plan <idea> |
Runs a PRD interview (asks structured questions), writes to docs/plans/<slug>.md |
/build <slice> |
TDD loop: test-writer writes failing tests, main agent implements, hooks enforce quality |
/check |
Delegates to the critic subagent for code review of the current diff |
/score |
Scores the current diff against evals/rubric.md (10-item behavioral checklist) |
/clear-and-go <slug> |
Resets context window, rehydrates plan state, shows progress summary |
Twelve auto-loaded skills in .claude/skills/ that teach Claude your stack's patterns:
Workflow:
prd-interview— structured questions to turn ideas into PRDsvertical-slice— cuts PRDs into independently shippable units
API + Tests:
fastapi-patterns— thin API layer wrapping LangGraph graphspytest-tdd— async testing withMemorySaver,fake_llmfixture, deterministic harness
Agent layer:
langgraph-patterns— state schemas, nodes, edges, checkpointers, subgraphslangchain-utilities— chat models, output parsers, model routing, tool decoratorslangsmith— production tracing and evals
Infrastructure:
gcp-runtime— Cloud Run, Secret Manager, Cloud Build CIgcp-state-and-events— Firestore checkpointer, Pub/Sub publishers/subscribersdocker-compose— local dev with Pub/Sub + Firestore emulators
Context:
context-management— proactive context hygiene (always loaded, see below)
Manual:
eval-task— describes the local eval format
Hone enforces a structured development loop:
/plan <idea>
│
▼
PRD interview (structured questions)
Writes docs/plans/<slug>.md
│
▼
Vertical slicing (3-7 shippable slices)
Appended to plan file
│
▼
/build <slice>
│
▼
test-writer writes failing tests
Main agent implements minimum code
Hooks auto-format + typecheck
Stop hook blocks if tests are red
│
▼
/check
│
▼
Critic reviews diff
(correctness, scope, security, architecture)
│
▼
/score
│
▼
Rubric scoring (10 items, 8/10 = green)
│
▼
Commit → /clear-and-go → next slice
Each slice is independently shippable and reversible. One slice per branch. Rebase before merging.
Long Claude Code sessions (2+ hours) degrade past ~100k tokens — the model starts hallucinating, forgetting decisions, and re-exploring files it already read. Hone manages this proactively through five mechanisms:
The inject-context.sh hook exports CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50, triggering auto-compact at ~100k tokens instead of the default ~190k. This fires early enough to prevent quality degradation.
The ## Compact Instructions section in CLAUDE.md tells the compactor what to preserve:
- Current slice and acceptance criteria
- All decisions made and their reasoning
- File paths modified
- Test state
- User preferences and corrections
And what to discard:
- Raw file contents (files still exist on disk)
- Verbose tool output (test logs, git diffs, grep results)
- Intermediate reasoning
At natural phase transitions (after exploration, after tests pass, after completing a slice), Claude writes a state snapshot to .claude/checkpoints/current.md:
# Checkpoint — 2026-05-05T14:32Z
## Active slice
classify-001: Implement classifier agent
## Decisions made
- Using Pydantic v2 state for structured validation
- model_router routes classify to gemini-2.5-flash
## Files modified
- app/agents/classifier/state.py (created)
- app/agents/classifier/nodes.py (created)
## Test state
- 3 tests written, 2 passing, 1 failing
## What's next
- Fix the failing test — edge function not wired upThis file is injected into every prompt by inject-context.sh. After a compact, Claude immediately has the full state snapshot. Checkpoint files are gitignored — they're session scratchpads, not committed artifacts.
The context-management skill teaches Claude to route noisy work through subagents:
- Use
explorerto map out code, then act on the file paths it returns - Use subagents for test output analysis, multi-file searches, log review
- Never read 5+ files into main context to "understand the codebase"
Instead of reading entire files, Claude is taught to:
Grepfor the function/class name to get the line numberReadwithoffsetandlimitto get just the 20-50 lines needed- Act on what was read
Files under ~100 lines can be read in full — the cost is negligible.
Blocks access to files matching secret patterns: .env, .pem, .key, credentials.*, secrets.*, *_secret.*, .p12, id_rsa.*. Exits with code 2 to block the tool.
Blocks edits to protected paths: migrations/, .git/, node_modules/, .venv/, dist/, build/. Directs to proper tools (e.g., alembic for migrations).
Runs ruff format and ruff check --fix on edited Python files. Runs asynchronously so it doesn't block Claude's turn.
Runs mypy --strict on edited Python files (skips tests/). Non-blocking — surfaces errors for Claude to fix but doesn't halt progress.
When Claude tries to end its turn, this hook:
- Derives a pytest
-kpattern from changed files via git - Runs
pytest -k <pattern> --maxfail=3 - Blocks the turn (exit 2) if tests are failing
- Allows the turn to end if tests pass or nothing changed
Runs on every user prompt. Does three things:
- Exports
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50for early auto-compaction - Injects git context (branch, recent commits, active plan file)
- Injects checkpoint state from
.claude/checkpoints/current.mdif it exists
Configured in .claude/settings.json:
rm -rf(all variants)git push --forceandgit push -fgit reset --hard
Read-only repo navigator. Use when the question is "where is X" or "find all Y in this codebase."
- Tools: Read, Grep, Glob
- Returns file paths and line numbers, not raw content
- Leads with a one-sentence answer, then references
- Never proposes changes or reads files unrelated to the question
Code reviewer. Invoked by /check or when asking for a diff review.
- Tools: Read, Grep, Bash
- Reads
git diff main...HEADand affected files in full - Surfaces issues in four categories: Correctness > Scope creep > Security > Architecture violations
- Does NOT flag formatting (Ruff), types (mypy), or style preferences
- If nothing is wrong, says so in one line
Pytest test writer. Invoked by /build or when asked to add tests.
- Tools: Read, Grep, Edit, Write
- Produces a test file that fails initially and would pass after correct implementation
- Uses
httpx.AsyncClientfor endpoints,MemorySaverfor graphs,fake_llmfor LLM stubs - Every test asserts on specific behavior, not just status codes
- Never writes implementation code
Runs the prd-interview skill. Asks 15-30 structured questions (one at a time) covering: who triggers this, input/output shapes, edge cases, error handling, success criteria, scope. Writes the PRD to docs/plans/<slug>.md. Does not propose vertical slices or write code.
Implements a vertical slice using strict TDD:
- Reads
docs/plans/<slug>.md, identifies the target slice - Delegates to
test-writerfor failing tests - Runs tests to confirm failure
- Implements minimum code to pass
- Hooks auto-format and typecheck
- Stop hook blocks if tests are red
- Summarizes when green
Delegates to the critic subagent. Passes through findings grouped by category (correctness, scope, security, architecture) or reports "nothing wrong."
Reads evals/rubric.md and scores the current diff against 10 items:
- Process discipline: tests before implementation, stayed in scope, no hook bypass
- Output quality: no over-engineering, architecture respected, Pydantic at boundaries
- Verification gates: pytest passes, ruff passes, mypy passes, coverage rule
Each item is pass/fail/partial. 8/10 or better is a green run. Report is saved to evals/runs/manual-<timestamp>.md.
Resets context between slices:
- Reads
docs/plans/<slug>.mdand current state of referenced files - Shows a one-paragraph summary: which slices are done, which is next, test status
- Waits for instruction — does not start coding
Skills auto-load based on their description matching the current task context. They teach Claude stack-specific patterns without cluttering CLAUDE.md.
prd-interview — Turns ideas into PRDs through structured questions. Covers triggers, input/output shapes, edge cases, error handling, success criteria, and scope. Writes to docs/plans/<slug>.md.
vertical-slice — Cuts PRDs into 3-7 independently shippable units. Each slice touches every needed layer, has working tests, can ship to main without breaking anything, and is reversible.
fastapi-patterns — Thin API layer patterns. Routes validate input, invoke the compiled graph, return the response. Covers synchronous, async (Pub/Sub + 202), and streaming (SSE) patterns. No business logic in routes.
pytest-tdd — Testing patterns for LangGraph agents and FastAPI endpoints. conftest.py layout with client, graph, and fake_llm fixtures. Tests nodes (unit), graphs (integration), edges (pure functions), API layer, and async runs.
langgraph-patterns — State schemas (TypedDict or Pydantic v2 with Annotated[..., add_messages]), async @traceable nodes, pure edge functions, graph building, checkpointers, tool calling, subgraphs.
langchain-utilities — Chat models wrapped in functions for easy swapping, centralized model routing (model_for_task()), structured output via with_structured_output(), ChatPromptTemplate, @tool decorator.
langsmith — Production tracing with @traceable on every node, run metadata for filtering, LLM-as-judge evals on sampled runs, dataset building from production traces.
gcp-runtime — Cloud Run (two-stage Dockerfile, exec uvicorn), Secret Manager (Pydantic Settings), Cloud Build CI (pytest/ruff/mypy + deploy), IAM (one SA per service, least privilege).
gcp-state-and-events — Firestore as LangGraph checkpointer (custom BaseCheckpointSaver), Pub/Sub for async agent triggers (publish/push pattern), idempotency via status checks, emulators for local dev.
docker-compose — Local dev stack with Firestore and Pub/Sub emulators. Same Docker image for dev and prod (only env vars differ). Test override skips emulators (tests use MemorySaver).
context-management — Always loaded. Teaches five behaviors: route noisy work through subagents, read targeted sections not full files, write checkpoints at phase transitions, prefer targeted tool calls, summarize before continuing. See Context Management.
10-item behavioral checklist applied to every golden task:
- Tests written before implementation (verifiable from git log)
- Stayed within slice scope
- No hook bypass
- No over-engineering
- Architecture respected (routes invoke graphs, not nodes directly)
- Pydantic models at boundaries
pytestpassesruff check .passesmypy --strict app/passes- Coverage rule (each test asserts specific behavior)
Runs Claude Code in headless mode against golden tasks:
python evals/run.py # all tasks
python evals/run.py --task <name> # one task
python evals/run.py --only-failing # re-run failed from last reportProcess:
- Runs setup commands from the task file
- Invokes
claude -p <prompt> --dangerously-skip-permissions - Captures git diff stats
- Runs acceptance commands (pytest, ruff, mypy)
- Scores rubric (auto-checks what it can, marks rest as "manual")
- Writes JSON report to
evals/runs/
Each task is a markdown file with four sections:
## Setup
Commands to run before Claude (e.g., git checkout, uv sync)
## Prompt
Exact prompt to send Claude in headless mode
## Acceptance
Commands that must exit 0 (e.g., pytest, ruff, mypy)
## Rubric overrides
Additional behavioral checks beyond the default rubricTwo example tasks are included:
example-classify-agent.md— Build a minimal LangGraph agent with TDDexample-create-user.md— Build an async endpoint with Pub/Sub
Runs the eval suite on PRs that touch CLAUDE.md, .claude/, evals/, or docs/conventions.md. Uploads the report as a build artifact.
mkdir -p .claude/skills/my-skillCreate .claude/skills/my-skill/SKILL.md:
---
name: my-skill
description: Auto-loads when ... (be specific or it loads too often)
---
# My Skill
Patterns and rules here.The description field is the auto-load trigger. Keep it narrow — a skill that loads on every task wastes context.
Edit .claude/settings.json. Available events: PreToolUse, PostToolUse, Stop, SubagentStop, UserPromptSubmit. Each hook receives JSON on stdin describing the tool call.
cp evals/tasks/example-create-user.md evals/tasks/your-task.md
# Edit the Setup, Prompt, Acceptance, and Rubric overrides sections
python evals/run.py --task your-taskIf you're not using the full AIMANA stack, remove irrelevant skills:
# Not using GCP? Remove these:
rm -rf .claude/skills/gcp-runtime
rm -rf .claude/skills/gcp-state-and-events
rm -rf .claude/skills/docker-compose
# Not using LangGraph? Remove these:
rm -rf .claude/skills/langgraph-patterns
rm -rf .claude/skills/langchain-utilities
rm -rf .claude/skills/langsmithThe harness primitives (CLAUDE.md, hooks, subagents, commands, context management, evals) are stack-agnostic. Only the skills are stack-specific.
Don't install everything on day one. Order matters.
Day 1. Drop in CLAUDE.md and the four core hooks (block-secrets, block-protected, format-on-write, tests-must-pass). Use the harness on real work for a few days. See what hurts.
Day 2. Add inject-context.sh and the context management skill. Add the three subagents and the slash commands.
Day 3. Add the skills relevant to your stack. Don't blanket-install — import only the ones that match work you actually do.
Day 4. Build the eval harness. Write 3-5 golden tasks based on real work. Set up the GitHub Action.
Beyond. Iterate based on what evals tell you. Add project-specific skills as patterns emerge. Adjust the compact threshold if 50% is too aggressive or too lax.
.
├── CLAUDE.md # Root instructions (~60 lines)
├── install.sh # curl | bash installer
├── npm/ # npx create-hone package source
├── .gitignore # Ignores checkpoints, logs, caches, eval runs
├── docs/
│ ├── architecture.md # Three-layer architecture: API → Agent → Tools
│ ├── conventions.md # Naming, imports, types, errors, git, tests
│ └── plans/ # Plan files for in-flight features
├── .claude/
│ ├── settings.json # Hook config + permission denylist
│ ├── checkpoints/ # Session checkpoint files (gitignored)
│ ├── agents/
│ │ ├── explorer.md # Read-only repo navigator
│ │ ├── critic.md # Code reviewer
│ │ └── test-writer.md # Pytest test writer
│ ├── commands/
│ │ ├── plan.md # /plan — PRD interview
│ │ ├── build.md # /build — TDD implementation
│ │ ├── check.md # /check — code review
│ │ ├── score.md # /score — rubric scoring
│ │ └── clear-and-go.md # /clear-and-go — context reset
│ ├── hooks/
│ │ ├── block-secrets.sh # Blocks secret file access
│ │ ├── block-protected.sh # Blocks protected path edits
│ │ ├── format-on-write.sh # Auto-formats with ruff
│ │ ├── typecheck-on-write.sh # Auto-typechecks with mypy
│ │ ├── tests-must-pass.sh # Blocks turn if tests fail
│ │ └── inject-context.sh # Injects git + checkpoint context
│ ├── skills/
│ │ ├── context-management/ # Proactive context hygiene (always loaded)
│ │ ├── prd-interview/ # Structured PRD questions
│ │ ├── vertical-slice/ # Slice PRDs into shippable units
│ │ ├── fastapi-patterns/ # Thin API layer patterns
│ │ ├── pytest-tdd/ # Test patterns and fixtures
│ │ ├── langgraph-patterns/ # Agent design patterns
│ │ ├── langchain-utilities/ # Chat models, routing, tools
│ │ ├── langsmith/ # Production tracing and evals
│ │ ├── gcp-runtime/ # Cloud Run, Secret Manager, CI
│ │ ├── gcp-state-and-events/ # Firestore checkpointer, Pub/Sub
│ │ ├── docker-compose/ # Local dev with emulators
│ │ └── eval-task/ # Local eval format (manual)
│ └── logs/ # Hook logs (gitignored)
├── evals/
│ ├── rubric.md # 10-item behavioral checklist
│ ├── run.py # Headless eval runner
│ ├── tasks/ # Golden task files
│ └── runs/ # Eval reports (gitignored)
└── .github/workflows/
└── eval.yml # CI: runs evals on harness PRs
Built at AIMANA. Maintained as needed, not as theater.