Skip to content

191341025/Self-Evolving-Skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Evolving Skill Pattern

English | 中文

A design pattern for Claude Code Skills that improve through use — growing more accurate and efficient over time, without bloating.

License: CC BY-SA 4.0 Claude Code Paper

Note

Academic positioning: This pattern corresponds to Inter-test-time Context Evolution with Text-Feedback Governance in the self-evolving agent literature. See Gao et al. (2026) "A Survey of Self-Evolving Agents."


The Problem

Traditional Skills are static — an author packages them once, users invoke them repeatedly, and knowledge never grows.

But in domains like database investigation, codebase analysis, and business system integration, an AI continuously discovers valuable domain knowledge during use — table relationships, query patterns, business rules, data characteristics. Without a way to persist this knowledge, every new session starts from zero, wasting both effort and context window.


Quick Start

Is this pattern right for your use case? Ask two questions:

  1. Will domain knowledge grow through use?
  2. Does that growth have a natural ceiling?

If both answers are yes, this pattern fits.

skill-name/
├── SKILL.md                        # Trigger conditions + governance protocol
├── scripts/                        # Execution tools
│   ├── core/                       # Computation layer (decay model)
│   │   ├── formulas.py             # Atomic formulas
│   │   ├── models.py               # Composite models + config
│   │   └── parser.py               # Decay tag parser
│   ├── decay_engine.py             # CLI: init / scan / feedback / reset / inject / search
│   └── *.py                        # Domain-specific tools
└── references/                     # Living knowledge base (AI-maintained)
    ├── _index.md                   # Routing table (<40 lines)
    └── <topic>.md                  # Topic files with decay-tagged entries

Install the Reference Implementation

The reference implementation is a Self-Evolving Skill for MySQL database investigation. Install it to see the pattern in action on your own database.

Prerequisites

  • An AI coding agent (Claude Code, Cursor, Windsurf, Codex, etc.)
  • Node.js ≥ 18 and Python ≥ 3.8
  • pip install pymysql

Step 1 — Install

macOS / Linux:

npx skills add 191341025/Self-Evolving-Skill --skill db-investigator

Windows:

npx skills add 191341025/Self-Evolving-Skill --skill db-investigator --copy -y

--copy bypasses Windows symlink permission issues; -y skips interactive agent selection.

Step 2 — Configure

Run setup.py from the installed skill directory:

# Find your agent's skill path (one of these will exist):
#   .claude/skills/   .cursor/skills/   .windsurf/skills/   .continue/skills/
python <agent>/skills/db-investigator/scripts/setup.py

The interactive wizard collects your MySQL connection details, tests the connection, and initializes the knowledge system.

Tip: Or just start a conversation and ask a database question — if unconfigured, the skill will tell you exactly what to run.

Step 3 — Use

Start a Claude Code conversation and ask any database question. The skill activates automatically and begins evolving its domain knowledge through use.


The Five-Gate Governance Protocol

This is the core of the pattern. It prevents the knowledge base from degrading into noise.

Gate 1 — VALUE
  Q: Can this knowledge be reused across sessions?
  → One-time result (e.g., "query returned 42 rows at 3pm") → REJECT
  → Reusable pattern or stable fact → PASS

Gate 2 — ALIGNMENT
  Q: Does this contradict existing knowledge?
  → Contradiction found → CORRECT the existing entry (don't append)
  → Consistent → PASS

Gate 3 — REDUNDANCY
  Q: Does this already exist, possibly worded differently?
  → Exists → MERGE into existing entry, or skip
  → Doesn't exist → PASS

Gate 4 — FRESHNESS (write)
  Classify knowledge type and attach decay metadata:
  → <!-- decay: type=<type> confirmed=<YYYY-MM-DD> C0=1.0 -->
  → Six types: schema | business_rule | tool_experience | query_pattern | data_range | data_snapshot
  → High-decay types (data_range, data_snapshot): prefer rejection

Gate 4 — FRESHNESS (read)
  Run confidence scan before using knowledge:
  → Tool computes C(t) based on time elapsed and feedback history
  → TRUST (C>=0.8): use directly
  → VERIFY (0.5<=C<0.8): use but flag for verification
  → REVALIDATE (C<0.5): verify with tools first

Gate 4 — FRESHNESS (feedback)
  After operations that used knowledge:
  → Success → record positive feedback (slows future decay)
  → Failure → record negative feedback (accelerates decay)
  → After revalidation passes → reset to fresh state

Gate 5 — PLACEMENT
  Q: Which file does this belong in? Which memory layer?
  → Existing topic → Add to that file
  → New topic → Only create a new file if 3+ related entries exist; update _index.md

The most common outcome of the Five Gates is: do nothing. Most interactions don't produce knowledge worth storing. The protocol's primary job is to reject, not to accept.

Governance Capabilities

Capability Mechanism
Add knowledge Must pass all five gates
Correct errors Gate 2 detects contradictions; fix in place
Deduplicate Gate 3 merges rather than appends
Expire stale data Gate 4 confidence decay model; tool-computed freshness with Bayesian feedback
Maintain structure Gate 5 + scaling rules control file granularity

Architecture

Three-Level Loading

Level Loaded when Content Change frequency
Level 1: frontmatter Always, in system prompt "When to use this Skill, how to behave" Rarely changes
Level 2: body When Claude judges the task is relevant "Which tools, how to govern knowledge" Stable
Level 3: references/ Claude navigates on demand Living domain knowledge Selective evolution
Level 3: scripts/ Claude invokes on demand Execution tools Extend as needed

The key distinction: A traditional Skill's Level 3 is static reference documentation. A Self-Evolving Skill's Level 3 is a living knowledge base, maintained by the AI under the governance protocol.

Computation Layer: LLM Judges, Python Computes

The confidence decay model's calculations (exponential decay, Bayesian feedback adjustment, threshold classification) are too complex for LLM "mental math" in prompts. The pattern separates concerns:

  • SKILL.md (Prompt layer): Defines when to call tools and how to respond to results
  • Python tools (Computation layer): Executes all math, returns clear conclusions
  • LLM does not need to know the formulas — it runs the tool and acts on the output
Layer 3 architecture:
scripts/
├── core/
│   ├── formulas.py     <- Atomic formulas (exponential decay, Bayesian factor)
│   ├── models.py       <- Composite models (confidence calculation, classification)
│   └── parser.py       <- Decay tag I/O (read/write markdown tags)
└── decay_engine.py     <- CLI entry point (init / scan / search / feedback / reset / inject / invalidate)

This separation means formulas can grow in complexity without bloating SKILL.md, and every formula is unit-testable.

Tip

Interactive Decay Model Visualization — Explore the confidence formula C(t) = C₀ × e^(-λ × (β+1)/(α+1) × t) with adjustable parameters. See how knowledge type, positive/negative feedback, and time interact to produce TRUST / VERIFY / REVALIDATE decisions.

Selective Injection via Routing Table

Knowledge is loaded by route, not in bulk:

Skill triggered
    ↓
Read _index.md (routing table, <40 lines, topic list + one-line summaries)
    ↓
Determine which topic files are relevant to the current conversation
    ↓
Load only 1–2 relevant topic files
    ↓
Irrelevant files are not loaded — context window preserved

Why not load everything? A Skill is a prompt injection layer — context window is a finite resource. When the knowledge base grows to 5 topic files at 50–80 lines each, bulk loading wastes 250–400 lines. The routing table keeps injection size under control regardless of how much the knowledge base grows.

Scaling Rules

  • Single topic file exceeds ~80 lines → split into sub-topics
  • Total topic files exceed 8 → review for merge opportunities
  • _index.md must stay under 40 lines (routing only, no detail)

Memory Layer Model

Adapted from the hierarchical memory architecture described in Gao et al. (2026):

Layer What it stores File Example
Structural knowledge Static relationships between tables, SPs, databases schema_map.md "orders.customer_id references customers.id"
Business rules Stable rules confirmed through interaction business_rules.md "Skip if customer not found; don't create order"
Query patterns Reusable single-step SQL templates query_patterns.md "Count orders grouped by status"
Investigation flows Reusable multi-step investigation procedures investigation_flows.md "Debug duplicate orders: Step1 check customer → Step2 check dedup conditions → Step3 compare results"

Knowledge typically distills upward: A successful investigation may first add to schema_map.md (new table relationship discovered), then query_patterns.md (effective query extracted), then investigation_flows.md (full procedure consolidated). But not every interaction completes the full chain — if you only learned a new field name, schema_map.md is enough.


Tool Experience Accumulation

The survey describes three stages of tool evolution: creation → mastery → management/selection. This pattern does not pursue full tool evolution (it's not appropriate for Skills to automatically create or modify scripts), but it does support lightweight accumulation of tool-use experience:

  • Effective parameter combinations: The best parameter settings for a given query script in specific scenarios
  • Boundary conditions: Known pitfalls and caveats (e.g., "connection timeout requires reconnect")
  • Composition patterns: Effective sequences for chaining multiple tools together

These experiences are recorded as annotations within query_patterns.md or investigation_flows.md — no separate file is created.

The goal is not to make tools evolve autonomously, but to make the AI more efficient next time it uses a tool — like a skilled mechanic knowing which wrench fits which bolt.


Design Philosophy

Three Principles

1. Evolution is demand-driven, not self-initiated

Every change must be triggered by real user interaction — business needs pull knowledge accumulation, not the mechanism itself pushing changes. A Skill doesn't go looking for things to learn. It waits for real business problems to drive its growth.

This is a deliberate divergence from academic systems like Voyager, which autonomously explore. Those systems operate at the model parameter level. At the Skill (prompt injection) level, evolution signals should come from real business scenarios, not artificially generated exploration tasks.

2. Selective growth, not continuous growth

Not every interaction should change anything. "Admission is harder than deletion" — each of the five gates asks "do you really need to add this?" The survey literature calls this Selective Evolution vs. Continuous Evolution. A stable, high-quality rule beats ten vague notes.

3. Maturity means stability

A sufficiently mature Skill that stops growing is healthy — it's the intended end state, not a failure. When domain knowledge covers 90%+ of everyday scenarios, the Skill should converge. New growth is only triggered when the business itself changes (new tables, new rules, new processes).

This directly addresses the stability-plasticity dilemma described in the literature: excessive plasticity (constant change) causes catastrophic forgetting; timely stability is the mark of a mature system.

Positioning Against the Academic Framework

From Gao et al. (2026)'s four-dimensional classification:

Dimension Full spectrum (paper) This pattern's position Design rationale
What evolves Model weights / context / tools / architecture Context (memory + prompts) + lightweight tool experience Skills cannot modify model weights or architecture; context is the only persistable evolution site
When to evolve Intra-test-time / Inter-test-time Inter-test-time (cross-session) Knowledge persists in references/ and is reused across sessions
How to evolve Reward-driven / imitation learning / population evolution Text-feedback + Five-Gate governance AI judgment as "reward signal"; Five Gates as governance constraint
Where to evolve General / specialized domains Specialized domains (e.g., database investigation) Domains where knowledge grows through use

Intentional non-goals:

  • No model weight modification — not possible or necessary at the Skill layer
  • No architecture search — Skill structure is stable; only knowledge content changes
  • No autonomous exploration — evolution is driven by real interaction, not self-generated tasks
  • No perpetual growth — convergence to stability is the goal, not a problem

Misevolution Protection

The survey literature introduces Misevolution — quality degradation during self-evolution. In the Skill context:

Risk Manifestation Protection
Knowledge pollution Incorrect rules written in, causing downstream errors Gate 2 (ALIGNMENT) + user confirmation for critical business rules
Information bloat Low-value knowledge accumulates, burying important rules Gate 1 (VALUE) + Gate 3 (REDUNDANCY) + scaling rules
Stale data Deprecated table structures or SPs remain in knowledge base Gate 4 (FRESHNESS) + confidence decay model + feedback-driven revalidation
Instruction conflict SKILL.md instructions contradict references/ knowledge SKILL.md changes rarely; references/ isolated via routing

The fundamental safety advantage: A Self-Evolving Skill operates entirely at the context layer. It does not modify model parameters or system architecture. Even if the knowledge base develops an error, the worst case is "one bad reference fact" — not "irreversible behavioral change in the model." This is a natural safety advantage over model-level self-evolution.


Maturity Signals

No KPI-style metrics are needed. Watch for these qualitative signals:

Signal Meaning
Most database questions can be answered without consulting references/ Knowledge has been internalized into conversational capability
Five Gates repeatedly return "do nothing" Knowledge base covers common scenarios
_index.md topic list has stabilized Domain structure has converged
New knowledge is mostly Gate 4 freshness updates, not new facts Framework is stable; only data is refreshing

Maturity Stages

Nascent:  references/ nearly empty; most interactions produce new knowledge
    ↓
Growing:  Core structure and rules established; new additions slowing
    ↓
Mature:   Knowledge base stable; updates only on business changes
    ↓
Business change triggers: new systems, schema refactors → partial return to Growing

Comparison with Traditional Skills

Traditional Skill Self-Evolving Skill
Knowledge source Author-defined at creation Predefined + accumulated through use
Lifecycle Fixed after creation Selective evolution → convergence to maturity
Quality control Manual author maintenance Five-Gate protocol (self-governing)
Context efficiency Full body injection Routing table + selective injection
Cross-project reuse Copy entire Skill Copy Skill skeleton + clear references/; accumulate fresh in new project
Cross-session continuity None Knowledge persists in references/
End state Never changes Stable (not changing = success)

When to Use (and When Not To)

Use this pattern when:

  • Database investigation — table relationships and query patterns accumulate through use
  • Codebase analysis — architectural understanding deepens over time
  • Business system integration — business rules get confirmed through conversation
  • Any domain where knowledge grows through use and has a natural ceiling

Don't use this pattern when:

  • Pure tool Skills (e.g., a PDF reader — no domain knowledge to accumulate)
  • Fully predetermined knowledge Skills (e.g., coding style guides — rules don't change through use)
  • Unbounded knowledge growth scenarios (these require model-level evolution, not Skill-level)

Reference Implementation

This repository includes a complete reference implementation under the Claude Code skill path: .claude/skills/db-investigator/

A Self-Evolving Skill for MySQL database investigation, demonstrating:

Component Path Description
Skill definition SKILL.md frontmatter (triggers) + body (tool selection, Five-Gate protocol, scaling rules)
Computation layer scripts/core/ Confidence decay formulas, models, and tag parser
Decay engine scripts/decay_engine.py CLI: init, scan, search, feedback, reset, inject, invalidate (7 subcommands)
Execution tools scripts/ Read-only tools: data query, structure fetch, metadata index
Knowledge routing references/_index.md Routing table
Domain knowledge references/*.md Living knowledge entries with decay metadata tags
Structure cache db_schemas/ Offline cache for fetched structures

Tip

Tool layer is swappable. The Self-Evolving Skill pattern is tool-layer agnostic. If you prefer MCP (Model Context Protocol) — e.g., a MySQL MCP Server — you can replace the Python scripts entirely. Just update allowed-tools in SKILL.md to point to your MCP tools instead. The core of the pattern (Five-Gate governance, references/ knowledge system, selective injection) is completely independent of how queries reach the database.

Choose what fits your setup: Python scripts for simplicity and portability, MCP for native Claude Code integration.


Empirical Validation

We ran full evolution experiments on real databases to validate the design pattern.

View all experiment data

Experiment Domain Rounds Key Findings
#01 nan-platform v1 Smart Building Mgmt (29 tables) 5 63.6% rejection rate, increments converge +75→+1, 2 Gate 2 self-corrections
#01 nan-platform v2 Same domain, Gate 4 validation 5 Confidence decay model verified, 25/25 tasks completed, Bayesian feedback validated
#03 nan-platform v3 Same domain, Phase 5 validation 5+1 6/6 verification points passed, entities+search enable precise Gate 2/3, hard/soft signals validated, R6 time decay confirmed (type-specific λ + Bayesian deceleration observable at t=1d)

Each experiment includes full evolution logs, Five-Gate decision records, quality audits, and per-round knowledge snapshots — you can diff any two rounds to observe exactly how the Skill's knowledge grew.


Implementation Checklist

1. Define domain boundaries
   Ask: Will knowledge grow through use? Does growth have a ceiling?
   If both yes → proceed.

2. Create directory structure and bootstrap
   skill-name/
   ├── SKILL.md
   ├── scripts/
   │   └── core/              # Computation layer (decay model)
   └── references/
       └── _index.md
   Run `python decay_engine.py init` to scaffold references/, _index.md, and db_config.ini.

3. Write SKILL.md
   - frontmatter: trigger conditions + behavioral guidelines
   - body: tool selection + Five-Gate governance protocol + scaling rules
   - Gate 4 freshness: delegate confidence math to scripts/core/ tools, not inline formulas

4. Initialize references/
   - _index.md: empty routing table
   - 0–2 initial topic files (if domain knowledge is already known)

5. Begin using
   - AI discovers valuable knowledge through real interactions
   - Knowledge passes through Five Gates before being written
   - _index.md updated accordingly

6. Let it converge naturally
   - Knowledge grows selectively; Five Gates prevent bloat
   - Files split or merge when thresholds are reached
   - Mature Skill stabilizes; updates only when the business changes

Roadmap: v2 — Knowledge Lifecycle Governance

Note

v2 is in the design phase. The v1 reference implementation (db-investigator) above remains fully functional.

The pattern is evolving from a database-specific skill to a general-purpose knowledge lifecycle governance mechanism. Core changes:

Two-Phase Knowledge Lifecycle

  • Phase 1 — Knowledge Admission (Distillation): Knowledge must prove its value through repeated use before being accepted. A cumulative distill_score = Σ (Wc × Wr × e^(-λd × days)) tracks credibility-weighted, recency-decayed confirmations. Only when the score crosses a threshold does knowledge graduate to VALIDATED status.
  • Phase 2 — Knowledge Freshness (Preservation): The existing decay model C(t) = C₀ × e^(-λ × (β+1)/(α+1) × t) with asymmetric feedback — success adds cautious credit (+0.3), failure adds aggressive penalty (+1.5). Five correct uses are needed to offset one failure.

Vector Database Storage

  • Moving from Markdown files + routing table to LanceDB (vector database) + BAAI/bge-small-zh-v1.5 (Chinese embedding model)
  • Enables semantic search, vector-based deduplication, and CANDIDATE/VALIDATED partitioning

Domain-Agnostic Governance

  • The governance layer (Five Gates, decay model, vector storage) is decoupled from any specific domain
  • db-investigator becomes the first domain plugin; other domains can reuse the same governance infrastructure

Always-On Activation

  • The skill triggers on every interaction — simultaneously retrieving relevant knowledge and observing whether new knowledge emerges
  • Knowledge capture is natural and continuous, not a separate step

Design documents: research/design/knowledge-lifecycle-v2.md | research/design/implementation-plan-v2.md


References

  • Gao, H., Geng, J., et al. (2026). "A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence." Transactions on Machine Learning Research. arXiv:2507.21046v4. (arXiv)
  • This pattern primarily relates to Section 3.2 (Context Evolution: Memory + Prompt) and Section 4.2 (Inter-test-time Evolution) of the above survey.
  • The Five-Gate protocol aligns with the memory management operations (ADD / MERGE / UPDATE / DELETE) described in Mem0, with a more systematic governance structure layered on top.
  • The memory layer model is inspired by MUSE's hierarchical memory architecture (strategic / procedural / tool-use), adapted for the Skill context as a four-layer structure.

Contributing

This pattern is derived from real-world use in telecom expense management (database investigation and billing audit scenarios). Feedback, case studies, and adaptations to other domains are welcome via Issues and PRs.

If you apply this pattern to a new domain, consider sharing:

  • What domain you applied it to
  • How long it took to reach the "Mature" stage
  • Which of the Five Gates fired most frequently in your case

License

This project is licensed under CC BY-SA 4.0.

The Self-Evolving Skill design pattern — including the Five-Gate Governance Protocol, three-level progressive loading architecture, selective injection mechanism, memory layer model, and maturity stage framework — is an original creation by the author. The pattern was independently conceived and subsequently positioned against the self-evolving agent survey by Gao et al. (2026). That survey addresses self-evolution at the agent and model parameter level; this pattern's distinct contribution is applying self-evolution principles to the Skill layer (prompt injection layer), operating entirely within the context window without modifying model weights or system architecture.

About

A design pattern for Claude Code Skills that improve through use — more accurate, more efficient, never bloated. | 越用越准、越用越快、但不臃肿的 Skill 设计模式

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages