English | 中文
A design pattern for Claude Code Skills that improve through use — growing more accurate and efficient over time, without bloating.
Note
Academic positioning: This pattern corresponds to Inter-test-time Context Evolution with Text-Feedback Governance in the self-evolving agent literature. See Gao et al. (2026) "A Survey of Self-Evolving Agents."
Traditional Skills are static — an author packages them once, users invoke them repeatedly, and knowledge never grows.
But in domains like database investigation, codebase analysis, and business system integration, an AI continuously discovers valuable domain knowledge during use — table relationships, query patterns, business rules, data characteristics. Without a way to persist this knowledge, every new session starts from zero, wasting both effort and context window.
Is this pattern right for your use case? Ask two questions:
- Will domain knowledge grow through use?
- Does that growth have a natural ceiling?
If both answers are yes, this pattern fits.
skill-name/
├── SKILL.md # Trigger conditions + governance protocol
├── scripts/ # Execution tools
│ ├── core/ # Computation layer (decay model)
│ │ ├── formulas.py # Atomic formulas
│ │ ├── models.py # Composite models + config
│ │ └── parser.py # Decay tag parser
│ ├── decay_engine.py # CLI: init / scan / feedback / reset / inject / search
│ └── *.py # Domain-specific tools
└── references/ # Living knowledge base (AI-maintained)
├── _index.md # Routing table (<40 lines)
└── <topic>.md # Topic files with decay-tagged entries
The reference implementation is a Self-Evolving Skill for MySQL database investigation. Install it to see the pattern in action on your own database.
- An AI coding agent (Claude Code, Cursor, Windsurf, Codex, etc.)
- Node.js ≥ 18 and Python ≥ 3.8
pip install pymysql
macOS / Linux:
npx skills add 191341025/Self-Evolving-Skill --skill db-investigatorWindows:
npx skills add 191341025/Self-Evolving-Skill --skill db-investigator --copy -y
--copybypasses Windows symlink permission issues;-yskips interactive agent selection.
Run setup.py from the installed skill directory:
# Find your agent's skill path (one of these will exist):
# .claude/skills/ .cursor/skills/ .windsurf/skills/ .continue/skills/
python <agent>/skills/db-investigator/scripts/setup.pyThe interactive wizard collects your MySQL connection details, tests the connection, and initializes the knowledge system.
Tip: Or just start a conversation and ask a database question — if unconfigured, the skill will tell you exactly what to run.
Start a Claude Code conversation and ask any database question. The skill activates automatically and begins evolving its domain knowledge through use.
This is the core of the pattern. It prevents the knowledge base from degrading into noise.
Gate 1 — VALUE
Q: Can this knowledge be reused across sessions?
→ One-time result (e.g., "query returned 42 rows at 3pm") → REJECT
→ Reusable pattern or stable fact → PASS
Gate 2 — ALIGNMENT
Q: Does this contradict existing knowledge?
→ Contradiction found → CORRECT the existing entry (don't append)
→ Consistent → PASS
Gate 3 — REDUNDANCY
Q: Does this already exist, possibly worded differently?
→ Exists → MERGE into existing entry, or skip
→ Doesn't exist → PASS
Gate 4 — FRESHNESS (write)
Classify knowledge type and attach decay metadata:
→ <!-- decay: type=<type> confirmed=<YYYY-MM-DD> C0=1.0 -->
→ Six types: schema | business_rule | tool_experience | query_pattern | data_range | data_snapshot
→ High-decay types (data_range, data_snapshot): prefer rejection
Gate 4 — FRESHNESS (read)
Run confidence scan before using knowledge:
→ Tool computes C(t) based on time elapsed and feedback history
→ TRUST (C>=0.8): use directly
→ VERIFY (0.5<=C<0.8): use but flag for verification
→ REVALIDATE (C<0.5): verify with tools first
Gate 4 — FRESHNESS (feedback)
After operations that used knowledge:
→ Success → record positive feedback (slows future decay)
→ Failure → record negative feedback (accelerates decay)
→ After revalidation passes → reset to fresh state
Gate 5 — PLACEMENT
Q: Which file does this belong in? Which memory layer?
→ Existing topic → Add to that file
→ New topic → Only create a new file if 3+ related entries exist; update _index.md
The most common outcome of the Five Gates is: do nothing. Most interactions don't produce knowledge worth storing. The protocol's primary job is to reject, not to accept.
| Capability | Mechanism |
|---|---|
| Add knowledge | Must pass all five gates |
| Correct errors | Gate 2 detects contradictions; fix in place |
| Deduplicate | Gate 3 merges rather than appends |
| Expire stale data | Gate 4 confidence decay model; tool-computed freshness with Bayesian feedback |
| Maintain structure | Gate 5 + scaling rules control file granularity |
| Level | Loaded when | Content | Change frequency |
|---|---|---|---|
| Level 1: frontmatter | Always, in system prompt | "When to use this Skill, how to behave" | Rarely changes |
| Level 2: body | When Claude judges the task is relevant | "Which tools, how to govern knowledge" | Stable |
| Level 3: references/ | Claude navigates on demand | Living domain knowledge | Selective evolution |
| Level 3: scripts/ | Claude invokes on demand | Execution tools | Extend as needed |
The key distinction: A traditional Skill's Level 3 is static reference documentation. A Self-Evolving Skill's Level 3 is a living knowledge base, maintained by the AI under the governance protocol.
The confidence decay model's calculations (exponential decay, Bayesian feedback adjustment, threshold classification) are too complex for LLM "mental math" in prompts. The pattern separates concerns:
- SKILL.md (Prompt layer): Defines when to call tools and how to respond to results
- Python tools (Computation layer): Executes all math, returns clear conclusions
- LLM does not need to know the formulas — it runs the tool and acts on the output
Layer 3 architecture:
scripts/
├── core/
│ ├── formulas.py <- Atomic formulas (exponential decay, Bayesian factor)
│ ├── models.py <- Composite models (confidence calculation, classification)
│ └── parser.py <- Decay tag I/O (read/write markdown tags)
└── decay_engine.py <- CLI entry point (init / scan / search / feedback / reset / inject / invalidate)
This separation means formulas can grow in complexity without bloating SKILL.md, and every formula is unit-testable.
Tip
Interactive Decay Model Visualization — Explore the confidence formula C(t) = C₀ × e^(-λ × (β+1)/(α+1) × t) with adjustable parameters. See how knowledge type, positive/negative feedback, and time interact to produce TRUST / VERIFY / REVALIDATE decisions.
Knowledge is loaded by route, not in bulk:
Skill triggered
↓
Read _index.md (routing table, <40 lines, topic list + one-line summaries)
↓
Determine which topic files are relevant to the current conversation
↓
Load only 1–2 relevant topic files
↓
Irrelevant files are not loaded — context window preserved
Why not load everything? A Skill is a prompt injection layer — context window is a finite resource. When the knowledge base grows to 5 topic files at 50–80 lines each, bulk loading wastes 250–400 lines. The routing table keeps injection size under control regardless of how much the knowledge base grows.
- Single topic file exceeds ~80 lines → split into sub-topics
- Total topic files exceed 8 → review for merge opportunities
_index.mdmust stay under 40 lines (routing only, no detail)
Adapted from the hierarchical memory architecture described in Gao et al. (2026):
| Layer | What it stores | File | Example |
|---|---|---|---|
| Structural knowledge | Static relationships between tables, SPs, databases | schema_map.md |
"orders.customer_id references customers.id" |
| Business rules | Stable rules confirmed through interaction | business_rules.md |
"Skip if customer not found; don't create order" |
| Query patterns | Reusable single-step SQL templates | query_patterns.md |
"Count orders grouped by status" |
| Investigation flows | Reusable multi-step investigation procedures | investigation_flows.md |
"Debug duplicate orders: Step1 check customer → Step2 check dedup conditions → Step3 compare results" |
Knowledge typically distills upward: A successful investigation may first add to schema_map.md (new table relationship discovered), then query_patterns.md (effective query extracted), then investigation_flows.md (full procedure consolidated). But not every interaction completes the full chain — if you only learned a new field name, schema_map.md is enough.
The survey describes three stages of tool evolution: creation → mastery → management/selection. This pattern does not pursue full tool evolution (it's not appropriate for Skills to automatically create or modify scripts), but it does support lightweight accumulation of tool-use experience:
- Effective parameter combinations: The best parameter settings for a given query script in specific scenarios
- Boundary conditions: Known pitfalls and caveats (e.g., "connection timeout requires reconnect")
- Composition patterns: Effective sequences for chaining multiple tools together
These experiences are recorded as annotations within query_patterns.md or investigation_flows.md — no separate file is created.
The goal is not to make tools evolve autonomously, but to make the AI more efficient next time it uses a tool — like a skilled mechanic knowing which wrench fits which bolt.
1. Evolution is demand-driven, not self-initiated
Every change must be triggered by real user interaction — business needs pull knowledge accumulation, not the mechanism itself pushing changes. A Skill doesn't go looking for things to learn. It waits for real business problems to drive its growth.
This is a deliberate divergence from academic systems like Voyager, which autonomously explore. Those systems operate at the model parameter level. At the Skill (prompt injection) level, evolution signals should come from real business scenarios, not artificially generated exploration tasks.
2. Selective growth, not continuous growth
Not every interaction should change anything. "Admission is harder than deletion" — each of the five gates asks "do you really need to add this?" The survey literature calls this Selective Evolution vs. Continuous Evolution. A stable, high-quality rule beats ten vague notes.
3. Maturity means stability
A sufficiently mature Skill that stops growing is healthy — it's the intended end state, not a failure. When domain knowledge covers 90%+ of everyday scenarios, the Skill should converge. New growth is only triggered when the business itself changes (new tables, new rules, new processes).
This directly addresses the stability-plasticity dilemma described in the literature: excessive plasticity (constant change) causes catastrophic forgetting; timely stability is the mark of a mature system.
From Gao et al. (2026)'s four-dimensional classification:
| Dimension | Full spectrum (paper) | This pattern's position | Design rationale |
|---|---|---|---|
| What evolves | Model weights / context / tools / architecture | Context (memory + prompts) + lightweight tool experience | Skills cannot modify model weights or architecture; context is the only persistable evolution site |
| When to evolve | Intra-test-time / Inter-test-time | Inter-test-time (cross-session) | Knowledge persists in references/ and is reused across sessions |
| How to evolve | Reward-driven / imitation learning / population evolution | Text-feedback + Five-Gate governance | AI judgment as "reward signal"; Five Gates as governance constraint |
| Where to evolve | General / specialized domains | Specialized domains (e.g., database investigation) | Domains where knowledge grows through use |
Intentional non-goals:
- No model weight modification — not possible or necessary at the Skill layer
- No architecture search — Skill structure is stable; only knowledge content changes
- No autonomous exploration — evolution is driven by real interaction, not self-generated tasks
- No perpetual growth — convergence to stability is the goal, not a problem
The survey literature introduces Misevolution — quality degradation during self-evolution. In the Skill context:
| Risk | Manifestation | Protection |
|---|---|---|
| Knowledge pollution | Incorrect rules written in, causing downstream errors | Gate 2 (ALIGNMENT) + user confirmation for critical business rules |
| Information bloat | Low-value knowledge accumulates, burying important rules | Gate 1 (VALUE) + Gate 3 (REDUNDANCY) + scaling rules |
| Stale data | Deprecated table structures or SPs remain in knowledge base | Gate 4 (FRESHNESS) + confidence decay model + feedback-driven revalidation |
| Instruction conflict | SKILL.md instructions contradict references/ knowledge | SKILL.md changes rarely; references/ isolated via routing |
The fundamental safety advantage: A Self-Evolving Skill operates entirely at the context layer. It does not modify model parameters or system architecture. Even if the knowledge base develops an error, the worst case is "one bad reference fact" — not "irreversible behavioral change in the model." This is a natural safety advantage over model-level self-evolution.
No KPI-style metrics are needed. Watch for these qualitative signals:
| Signal | Meaning |
|---|---|
| Most database questions can be answered without consulting references/ | Knowledge has been internalized into conversational capability |
| Five Gates repeatedly return "do nothing" | Knowledge base covers common scenarios |
_index.md topic list has stabilized |
Domain structure has converged |
| New knowledge is mostly Gate 4 freshness updates, not new facts | Framework is stable; only data is refreshing |
Nascent: references/ nearly empty; most interactions produce new knowledge
↓
Growing: Core structure and rules established; new additions slowing
↓
Mature: Knowledge base stable; updates only on business changes
↓
Business change triggers: new systems, schema refactors → partial return to Growing
| Traditional Skill | Self-Evolving Skill | |
|---|---|---|
| Knowledge source | Author-defined at creation | Predefined + accumulated through use |
| Lifecycle | Fixed after creation | Selective evolution → convergence to maturity |
| Quality control | Manual author maintenance | Five-Gate protocol (self-governing) |
| Context efficiency | Full body injection | Routing table + selective injection |
| Cross-project reuse | Copy entire Skill | Copy Skill skeleton + clear references/; accumulate fresh in new project |
| Cross-session continuity | None | Knowledge persists in references/ |
| End state | Never changes | Stable (not changing = success) |
Use this pattern when:
- Database investigation — table relationships and query patterns accumulate through use
- Codebase analysis — architectural understanding deepens over time
- Business system integration — business rules get confirmed through conversation
- Any domain where knowledge grows through use and has a natural ceiling
Don't use this pattern when:
- Pure tool Skills (e.g., a PDF reader — no domain knowledge to accumulate)
- Fully predetermined knowledge Skills (e.g., coding style guides — rules don't change through use)
- Unbounded knowledge growth scenarios (these require model-level evolution, not Skill-level)
This repository includes a complete reference implementation under the Claude Code skill path: .claude/skills/db-investigator/
A Self-Evolving Skill for MySQL database investigation, demonstrating:
| Component | Path | Description |
|---|---|---|
| Skill definition | SKILL.md |
frontmatter (triggers) + body (tool selection, Five-Gate protocol, scaling rules) |
| Computation layer | scripts/core/ |
Confidence decay formulas, models, and tag parser |
| Decay engine | scripts/decay_engine.py |
CLI: init, scan, search, feedback, reset, inject, invalidate (7 subcommands) |
| Execution tools | scripts/ |
Read-only tools: data query, structure fetch, metadata index |
| Knowledge routing | references/_index.md |
Routing table |
| Domain knowledge | references/*.md |
Living knowledge entries with decay metadata tags |
| Structure cache | db_schemas/ |
Offline cache for fetched structures |
Tip
Tool layer is swappable. The Self-Evolving Skill pattern is tool-layer agnostic. If you prefer MCP (Model Context Protocol) — e.g., a MySQL MCP Server — you can replace the Python scripts entirely. Just update allowed-tools in SKILL.md to point to your MCP tools instead. The core of the pattern (Five-Gate governance, references/ knowledge system, selective injection) is completely independent of how queries reach the database.
Choose what fits your setup: Python scripts for simplicity and portability, MCP for native Claude Code integration.
We ran full evolution experiments on real databases to validate the design pattern.
| Experiment | Domain | Rounds | Key Findings |
|---|---|---|---|
| #01 nan-platform v1 | Smart Building Mgmt (29 tables) | 5 | 63.6% rejection rate, increments converge +75→+1, 2 Gate 2 self-corrections |
| #01 nan-platform v2 | Same domain, Gate 4 validation | 5 | Confidence decay model verified, 25/25 tasks completed, Bayesian feedback validated |
| #03 nan-platform v3 | Same domain, Phase 5 validation | 5+1 | 6/6 verification points passed, entities+search enable precise Gate 2/3, hard/soft signals validated, R6 time decay confirmed (type-specific λ + Bayesian deceleration observable at t=1d) |
Each experiment includes full evolution logs, Five-Gate decision records, quality audits, and per-round knowledge snapshots — you can diff any two rounds to observe exactly how the Skill's knowledge grew.
1. Define domain boundaries
Ask: Will knowledge grow through use? Does growth have a ceiling?
If both yes → proceed.
2. Create directory structure and bootstrap
skill-name/
├── SKILL.md
├── scripts/
│ └── core/ # Computation layer (decay model)
└── references/
└── _index.md
Run `python decay_engine.py init` to scaffold references/, _index.md, and db_config.ini.
3. Write SKILL.md
- frontmatter: trigger conditions + behavioral guidelines
- body: tool selection + Five-Gate governance protocol + scaling rules
- Gate 4 freshness: delegate confidence math to scripts/core/ tools, not inline formulas
4. Initialize references/
- _index.md: empty routing table
- 0–2 initial topic files (if domain knowledge is already known)
5. Begin using
- AI discovers valuable knowledge through real interactions
- Knowledge passes through Five Gates before being written
- _index.md updated accordingly
6. Let it converge naturally
- Knowledge grows selectively; Five Gates prevent bloat
- Files split or merge when thresholds are reached
- Mature Skill stabilizes; updates only when the business changes
Note
v2 is in the design phase. The v1 reference implementation (db-investigator) above remains fully functional.
The pattern is evolving from a database-specific skill to a general-purpose knowledge lifecycle governance mechanism. Core changes:
Two-Phase Knowledge Lifecycle
- Phase 1 — Knowledge Admission (Distillation): Knowledge must prove its value through repeated use before being accepted. A cumulative
distill_score = Σ (Wc × Wr × e^(-λd × days))tracks credibility-weighted, recency-decayed confirmations. Only when the score crosses a threshold does knowledge graduate to VALIDATED status. - Phase 2 — Knowledge Freshness (Preservation): The existing decay model
C(t) = C₀ × e^(-λ × (β+1)/(α+1) × t)with asymmetric feedback — success adds cautious credit (+0.3), failure adds aggressive penalty (+1.5). Five correct uses are needed to offset one failure.
Vector Database Storage
- Moving from Markdown files + routing table to LanceDB (vector database) + BAAI/bge-small-zh-v1.5 (Chinese embedding model)
- Enables semantic search, vector-based deduplication, and CANDIDATE/VALIDATED partitioning
Domain-Agnostic Governance
- The governance layer (Five Gates, decay model, vector storage) is decoupled from any specific domain
- db-investigator becomes the first domain plugin; other domains can reuse the same governance infrastructure
Always-On Activation
- The skill triggers on every interaction — simultaneously retrieving relevant knowledge and observing whether new knowledge emerges
- Knowledge capture is natural and continuous, not a separate step
Design documents: research/design/knowledge-lifecycle-v2.md | research/design/implementation-plan-v2.md
- Gao, H., Geng, J., et al. (2026). "A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence." Transactions on Machine Learning Research. arXiv:2507.21046v4. (arXiv)
- This pattern primarily relates to Section 3.2 (Context Evolution: Memory + Prompt) and Section 4.2 (Inter-test-time Evolution) of the above survey.
- The Five-Gate protocol aligns with the memory management operations (ADD / MERGE / UPDATE / DELETE) described in Mem0, with a more systematic governance structure layered on top.
- The memory layer model is inspired by MUSE's hierarchical memory architecture (strategic / procedural / tool-use), adapted for the Skill context as a four-layer structure.
This pattern is derived from real-world use in telecom expense management (database investigation and billing audit scenarios). Feedback, case studies, and adaptations to other domains are welcome via Issues and PRs.
If you apply this pattern to a new domain, consider sharing:
- What domain you applied it to
- How long it took to reach the "Mature" stage
- Which of the Five Gates fired most frequently in your case
This project is licensed under CC BY-SA 4.0.
The Self-Evolving Skill design pattern — including the Five-Gate Governance Protocol, three-level progressive loading architecture, selective injection mechanism, memory layer model, and maturity stage framework — is an original creation by the author. The pattern was independently conceived and subsequently positioned against the self-evolving agent survey by Gao et al. (2026). That survey addresses self-evolution at the agent and model parameter level; this pattern's distinct contribution is applying self-evolution principles to the Skill layer (prompt injection layer), operating entirely within the context window without modifying model weights or system architecture.