# RuleChef Tutorial

RuleChef turns labeled examples into fast, deterministic rules. The key idea: **an LLM writes the rules at training time, but the rules run locally at inference time** — no API calls, sub-millisecond latency, zero cost per query.

### How it works

```
Your examples ──> RuleChef + LLM ──> Learned rules (regex/code) ──> Fast local execution (<1ms)
                   (training)           (saved to disk)               (no LLM needed)
```

The pipeline:

1. **Buffer** — `add_example()` collects examples in a buffer (not used immediately)
2. **Synthesis** — `learn_rules()` sends examples to an LLM, which writes regex/code patterns
3. **Evaluation** — rules are tested against examples, failures drive refinement
4. **Extraction** — `extract()` runs the learned rules locally, no LLM call

The LLM is only used during `learn_rules()`. Everything else — adding examples, extracting, evaluating — runs locally.

For the full architecture, see [How It Works](https://krlabsorg.github.io/rulechef/getting-started/concepts/) in the docs.

## Part 1: Getting Started

We'll set up an LLM client, define a task (medical NER), and create a RuleChef instance. The **Task** describes what we're extracting — entity types, schemas — and is used to build prompts for the LLM during rule synthesis.

In [None]:
# Install (run once)
# !pip install rulechef datasets

In [None]:
import os

# Set your API key — any OpenAI-compatible provider works (OpenAI, Groq, Together, etc.)
os.environ["OPENAI_API_KEY"] = ""

In [None]:
import tempfile

from openai import OpenAI

from rulechef import RuleChef, RuleFormat, Task, TaskType

# --- LLM client ---
client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
)
MODEL = "moonshotai/kimi-k2-instruct-0905"

# --- Task definition ---
task = Task(
    name="Medical Prescription NER",
    description=(
        "Extract DRUG, DOSAGE, FREQUENCY, and CONDITION entities from clinical "
        "prescription text. DRUG = medication name, DOSAGE = amount/strength, "
        "FREQUENCY = how often taken, CONDITION = medical condition being treated."
    ),
    input_schema={"text": "str"},
    output_schema={
        "entities": "List[{text: str, start: int, end: int, type: DRUG|DOSAGE|FREQUENCY|CONDITION}]"
    },
    type=TaskType.NER,
    text_field="text",
)

# --- Create RuleChef ---
storage = tempfile.mkdtemp(prefix="rulechef_tutorial_")
chef = RuleChef(
    task,
    client,
    dataset_name="medical_ner_tutorial",
    storage_path=storage,
    model=MODEL,
    allowed_formats=[RuleFormat.REGEX],  # regex-only for transparency
    use_grex=True,  # use grex to suggest regex patterns from examples
)
print(f"Storage: {storage}")

In [None]:
# --- Display helpers (rich) ---
# Reusable functions for inspecting rules, results, and evaluation during demos.
# pip install rich  (or: pip install rulechef[notebooks])

from rich.console import Console
from rich.panel import Panel
from rich.syntax import Syntax
from rich.table import Table
from rich.text import Text

console = Console()

# Entity type colors for NER highlighting
_ENTITY_COLORS = {
    "DRUG": "bold cyan",
    "DOSAGE": "bold yellow",
    "FREQUENCY": "bold green",
    "CONDITION": "bold magenta",
}
_FALLBACK_COLORS = [
    "bold cyan",
    "bold yellow",
    "bold green",
    "bold magenta",
    "bold red",
    "bold blue",
]


def _color_for(entity_type, seen=None):
    """Get a consistent color for an entity type."""
    if seen is None:
        seen = {}
    if entity_type in _ENTITY_COLORS:
        return _ENTITY_COLORS[entity_type]
    if entity_type not in seen:
        seen[entity_type] = _FALLBACK_COLORS[len(seen) % len(_FALLBACK_COLORS)]
    return seen[entity_type]


def _f1_color(val):
    if val >= 0.8:
        return "green"
    if val >= 0.5:
        return "yellow"
    return "red"


def show_rules(chef):
    """Show all rules in a compact table."""
    table = Table(title="Rules", show_lines=False)
    table.add_column("#", style="dim", width=3)
    table.add_column("Name", max_width=35)
    table.add_column("Format", width=6)
    table.add_column("Pattern", max_width=55)
    table.add_column("Type", width=12)
    table.add_column("Pri", width=3, justify="right")
    table.add_column("Conf", width=5, justify="right")

    for i, rule in enumerate(chef.dataset.rules):
        entity_type = ""
        if rule.output_template:
            entity_type = rule.output_template.get("type", rule.output_template.get("label", ""))
        pattern = rule.content[:55] + "..." if len(rule.content) > 55 else rule.content
        conf = rule.confidence
        conf_style = "green" if conf >= 0.7 else "yellow" if conf >= 0.4 else "red"

        table.add_row(
            str(i),
            rule.name,
            rule.format.value,
            pattern,
            entity_type,
            str(rule.priority),
            f"[{conf_style}]{conf:.2f}[/]",
        )
    console.print(table)


def show_rule(chef, index):
    """Deep dive into a single rule by index."""
    rules = chef.dataset.rules
    if isinstance(index, str):
        rule = next((r for r in rules if r.id == index), None)
        if not rule:
            console.print(f"[red]Rule '{index}' not found[/]")
            return
    else:
        if index >= len(rules):
            console.print(f"[red]Index {index} out of range (have {len(rules)} rules)[/]")
            return
        rule = rules[index]

    # Header info
    entity_type = ""
    if rule.output_template:
        entity_type = rule.output_template.get("type", rule.output_template.get("label", ""))
    header = f"[bold]{rule.name}[/]  |  {rule.format.value}  |  priority={rule.priority}  |  conf={rule.confidence:.2f}"
    if entity_type:
        header += f"  |  type={entity_type}"

    # Pattern with syntax highlighting
    if rule.format.value == "regex":
        syntax = Syntax(rule.content, "perl", theme="monokai", word_wrap=True)
    elif rule.format.value == "code":
        syntax = Syntax(rule.content, "python", theme="monokai", word_wrap=True)
    else:
        syntax = Syntax(rule.content, "json", theme="monokai", word_wrap=True)

    # Stats
    stats = f"Applied: {rule.times_applied}  |  Successes: {rule.successes}  |  Failures: {rule.failures}"
    if rule.times_applied > 0:
        rate = rule.successes / rule.times_applied * 100
        stats += f"  |  Success rate: {rate:.0f}%"

    # Build panel content
    content = Text()
    content.append(stats + "\n\n")

    panel = Panel.fit(
        syntax,
        title=header,
        subtitle=stats,
        border_style="blue",
    )
    console.print(panel)

    # Output template
    if rule.output_template:
        console.print(f"  Output template: {rule.output_template}")


def show_eval(eval_result):
    """Show evaluation results with color-coded per-class metrics."""
    if not eval_result or eval_result.total_docs == 0:
        console.print("[dim]No evaluation results.[/]")
        return

    # Summary
    f1_style = _f1_color(eval_result.micro_f1)
    summary = Text()
    summary.append("Micro F1: ")
    summary.append(f"{eval_result.micro_f1:.1%}", style=f"bold {f1_style}")
    summary.append(f"  |  Macro F1: {eval_result.macro_f1:.1%}")
    summary.append(f"  |  Exact match: {eval_result.exact_match:.1%}")
    summary.append(f"  |  P={eval_result.micro_precision:.1%}  R={eval_result.micro_recall:.1%}")
    summary.append(f"  |  {eval_result.total_docs} docs")
    console.print(summary)

    # Per-class table
    if eval_result.per_class:
        table = Table(show_lines=False)
        table.add_column("Class", min_width=15)
        table.add_column("F1", justify="right", width=6)
        table.add_column("Prec", justify="right", width=6)
        table.add_column("Recall", justify="right", width=6)
        table.add_column("TP", justify="right", width=4, style="green")
        table.add_column("FP", justify="right", width=4, style="red")
        table.add_column("FN", justify="right", width=4, style="yellow")

        for cm in sorted(eval_result.per_class, key=lambda c: c.f1, reverse=True):
            f1_style = _f1_color(cm.f1)
            table.add_row(
                cm.label,
                f"[{f1_style}]{cm.f1:.0%}[/]",
                f"{cm.precision:.0%}",
                f"{cm.recall:.0%}",
                str(cm.tp),
                str(cm.fp),
                str(cm.fn),
            )
        console.print(table)


def show_failures(eval_result, entity_type=None):
    """Show what the FPs and FNs actually are.

    Args:
        eval_result: EvalResult from chef.evaluate() or evaluate_dataset()
        entity_type: Optional filter — only show failures involving this type (e.g. "FREQUENCY")
    """
    if not eval_result.failures:
        console.print("[green]No failures — all examples matched![/]")
        return

    shown = 0
    for failure in eval_result.failures:
        inp = failure["input"]
        expected = failure["expected"]
        got = failure["got"]
        text = inp.get("text", str(inp))

        # Get entity lists (NER/extraction) or labels (classification)
        expected_entities = expected.get("entities", expected.get("spans", []))
        got_entities = got.get("entities", got.get("spans", []))

        # Classification: simple expected vs got label
        if not expected_entities and not got_entities:
            exp_label = expected.get("label", "")
            got_label = got.get("label", "")
            if entity_type and entity_type not in (exp_label, got_label):
                continue
            shown += 1
            t = Text()
            t.append(f"{shown}. ", style="dim")
            t.append(text)
            t.append("\n   expected: ", style="dim")
            t.append(exp_label, style="green")
            t.append("   got: ", style="dim")
            t.append(got_label or "(no match)", style="red")
            console.print(t)
            continue

        # NER: diff expected vs got to find FPs and FNs
        # Use list-based matching (not sets) so duplicates are caught correctly
        def _key(e):
            return (e.get("text", "").lower(), e.get("type", e.get("label", "")))

        remaining_expected = [_key(e) for e in expected_entities]
        fps = []
        for e in got_entities:
            k = _key(e)
            if k in remaining_expected:
                remaining_expected.remove(k)
            else:
                fps.append(e)

        remaining_got = [_key(e) for e in got_entities]
        fns = []
        for e in expected_entities:
            k = _key(e)
            if k in remaining_got:
                remaining_got.remove(k)
            else:
                fns.append(e)

        # Apply entity_type filter
        if entity_type:
            fps = [e for e in fps if e.get("type", e.get("label", "")) == entity_type]
            fns = [e for e in fns if e.get("type", e.get("label", "")) == entity_type]
            if not fps and not fns:
                continue

        shown += 1
        t = Text()
        t.append(f"{shown}. ", style="dim")
        t.append(text, style="white")
        console.print(t)

        # Track matched keys to detect duplicates
        matched_keys = [_key(e) for e in got_entities if _key(e) not in [_key(f) for f in fps]]

        for e in fps:
            etype = e.get("type", e.get("label", "?"))
            k = _key(e)
            is_dup = k in matched_keys  # same text+type already matched as TP
            tag = "FP (duplicate)" if is_dup else "FP"
            console.print(f"   [red]{tag}[/]  [{etype}: {e.get('text', '?')!r}]")
        for e in fns:
            etype = e.get("type", e.get("label", "?"))
            console.print(f"   [yellow]FN[/]  [{etype}: {e.get('text', '?')!r}]")

    if shown == 0:
        if entity_type:
            console.print(f"[dim]No failures involving {entity_type}.[/]")
        else:
            console.print("[dim]No failures to show.[/]")
    else:
        console.print(f"\n[dim]{shown} documents with errors[/]")


def test(chef, text):
    """Extract from text and display results with highlighted entities."""
    result = chef.extract({"text": text})

    entities = result.get("entities", result.get("spans", []))
    label = result.get("label")

    if entities:
        # Build annotated text with colored entity spans
        annotated = Text()
        prev_end = 0
        for e in sorted(entities, key=lambda x: x.get("start", 0)):
            start = e.get("start", 0)
            end = e.get("end", 0)
            etype = e.get("type", "?")
            color = _color_for(etype)

            annotated.append(text[prev_end:start])
            annotated.append(f"[{etype}: ", style="dim")
            annotated.append(e.get("text", text[start:end]), style=color)
            annotated.append("]", style="dim")
            prev_end = end
        annotated.append(text[prev_end:])
        console.print(annotated)
    elif label:
        console.print(Text.assemble(text, "  →  ", (label, "bold cyan")))
    else:
        console.print(Text.assemble(text, "  →  ", ("(no match)", "dim")))


print("Display helpers loaded: show_rules, show_rule, show_eval, show_failures, test")

### Add examples

Examples go into a **buffer** first — they're not used until you call `learn_rules()`. This lets you collect a batch of examples before learning, and gives the coordinator a chance to decide the best learning strategy.

We start with just 3 examples, each showing a clinical prescription and its entity spans.

In [None]:
examples = [
    (
        "Metformin 500mg twice daily for type 2 diabetes.",
        [
            {"text": "Metformin", "start": 0, "end": 9, "type": "DRUG"},
            {"text": "500mg", "start": 10, "end": 15, "type": "DOSAGE"},
            {"text": "twice daily", "start": 16, "end": 27, "type": "FREQUENCY"},
            {"text": "type 2 diabetes", "start": 32, "end": 47, "type": "CONDITION"},
        ],
    ),
    (
        "Take lisinopril 10mg once daily for high blood pressure.",
        [
            {"text": "lisinopril", "start": 5, "end": 15, "type": "DRUG"},
            {"text": "10mg", "start": 16, "end": 20, "type": "DOSAGE"},
            {"text": "once daily", "start": 21, "end": 31, "type": "FREQUENCY"},
            {"text": "high blood pressure", "start": 36, "end": 55, "type": "CONDITION"},
        ],
    ),
    (
        "Ibuprofen 400mg every 6 hours as needed for pain.",
        [
            {"text": "Ibuprofen", "start": 0, "end": 9, "type": "DRUG"},
            {"text": "400mg", "start": 10, "end": 15, "type": "DOSAGE"},
            {"text": "every 6 hours", "start": 16, "end": 29, "type": "FREQUENCY"},
            {"text": "pain", "start": 44, "end": 48, "type": "CONDITION"},
        ],
    ),
]

for text, entities in examples:
    chef.add_example({"text": text}, {"entities": entities})

# Examples sit in a buffer until we call learn_rules()
print(chef.get_buffer_stats())

### Learn rules

`learn_rules()` is where the LLM comes in. It:
1. **Commits** buffered examples to the dataset
2. **Sends** examples to the LLM with a synthesis prompt
3. **Gets back** regex rules that match the patterns in the examples
4. **Tests** the rules against the examples and patches failures

After this call, the rules are stored locally — no more LLM calls needed for extraction.

In [None]:
rules, eval_result = chef.learn_rules()

## Part 2: Inspect Everything

### Rule summary

A quick overview of what was learned:

In [None]:
show_rules(chef)

### Inspect a rule

`show_rule(chef, index)` shows the full pattern with syntax highlighting, output template, and stats. Try different indices to explore what was learned:

In [None]:
# Deep dive into the first rule — see the full regex, output template, stats
show_rule(chef, 0)

### Try them out

Let's test on texts the rules have **never seen**. We'll include several unseen FREQUENCY patterns — `"at bedtime"`, `"each morning"`, `"three times a day"` — to see how well the rules generalize beyond the training data.

Take note of any gaps — we'll fix them in Part 3.

In [None]:
test_texts = [
    "Metformin 1000mg once daily for insulin resistance.",  # should work
    "Take aspirin 81mg daily for cardiac prevention.",  # "Take" prefix
    "Gabapentin 300mg at bedtime for nerve pain.",  # "at bedtime" — unseen
    "Levothyroxine 100mcg each morning for hypothyroidism.",  # "each morning" — unseen
    "Amoxicillin 500mg three times a day for strep throat.",  # "three times a day" — unseen
    "The patient reported feeling dizzy after the procedure.",  # no entities
]

for text in test_texts:
    test(chef, text)

In [None]:
# How fast is it? No LLM call — just regex matching.
%timeit chef.extract({"text": "Metformin 500mg twice daily for type 2 diabetes."})

### Training accuracy

Quick check: how well do the rules do on the examples they were trained on? Even 100% here doesn't mean the rules generalize — the `test()` calls above are a better signal.

In [None]:
show_eval(chef.evaluate(verbose=False))

### Per-rule metrics

Which rules are pulling their weight? `get_rule_metrics()` evaluates each rule in isolation — useful for finding dead rules (0 matches) or harmful ones (low precision).

In [None]:
rule_metrics = chef.get_rule_metrics()

### Delete a rule

If a rule is dead (0 matches) or harmful, you can remove it:

In [None]:
# Find a dead rule (0 matches) if any
dead = [m for m in rule_metrics if m.matches == 0]
if dead:
    print(f"Deleting dead rule: {dead[0].rule_name} (id={dead[0].rule_id})")
    chef.delete_rule(dead[0].rule_id)
    print(f"Rules remaining: {len(chef.dataset.rules)}")
else:
    print("No dead rules found — all rules match something.")

## Part 3: Improve the Rules

The rules got high accuracy on training data — but the unseen tests in Part 2 likely showed some gaps. Two tools to fix that: **corrections** (fix a specific wrong answer) and **feedback** (general domain guidance).

### Corrections

Corrections are the **highest-value training signal** — they pair what the rules produced with what the output *should* be. They're always included first in the LLM prompt.

The test above likely showed a gap on `"Gabapentin 300mg at bedtime for nerve pain."` — `"at bedtime"` is a FREQUENCY pattern the rules never saw in training. Let's correct it:

In [None]:
# Correct the "at bedtime" test case from above
test_input = {"text": "Gabapentin 300mg at bedtime for nerve pain."}
test(chef, test_input["text"])

# Add a correction — tell RuleChef what the right answer should be
model_output = chef.extract(test_input)
chef.add_correction(
    test_input,
    model_output=model_output,
    expected_output={
        "entities": [
            {"text": "Gabapentin", "start": 0, "end": 10, "type": "DRUG"},
            {"text": "300mg", "start": 11, "end": 16, "type": "DOSAGE"},
            {"text": "at bedtime", "start": 17, "end": 27, "type": "FREQUENCY"},
            {"text": "nerve pain", "start": 32, "end": 42, "type": "CONDITION"},
        ]
    },
    feedback="FREQUENCY 'at bedtime' is a valid frequency pattern, "
    "similar to 'once daily' or 'every 6 hours'.",
)

In [None]:
# Incremental learn — patch existing rules instead of starting from scratch
rules, eval_result = chef.learn_rules(incremental_only=True)

In [None]:
# Test all unseen texts again — did the correction help?
for text in test_texts:
    test(chef, text)

### Feedback

The correction fixed one specific failure. But there are likely more — `"each morning"` and `"three times a day"` are also FREQUENCY patterns the rules may have missed.

**Feedback** is broader: natural-language guidance that gets injected into the LLM's synthesis prompt, steering *all future* rule generation. A single piece of feedback can fix an entire class of errors:

In [None]:
# Domain knowledge: help the LLM write more general FREQUENCY patterns
chef.add_feedback(
    "FREQUENCY entities include patterns like 'at bedtime', 'each morning', "
    "'three times a day', 'every N hours', 'as needed' — not just 'once/twice daily'. "
    "Always match the FULL frequency phrase: 'once daily' is one FREQUENCY entity, "
    "don't also match 'daily' alone as a separate entity. Cover these common patterns explicitly in the rules."
)

# Re-learn — feedback gets injected into the synthesis prompt
rules, eval_result = chef.learn_rules()

In [None]:
# Test the same unseen texts from Part 2 — did feedback help generalization?
for text in test_texts:
    test(chef, text)

### Sampling strategies

When you have lots of examples, RuleChef can't fit them all in the LLM prompt. It picks a subset using a **sampling strategy**:

| Strategy | What it does |
|----------|-------------|
| `balanced` | Takes examples in order (default) |
| `recent` | Newest examples first |
| `diversity` | Evenly spaces picks across the dataset |
| `uncertain` | Low-confidence examples first |
| `varied` | Mix of recent + diverse + uncertain |
| `corrections_first` | Corrections always go first (they already do by default, but this also sorts examples by recency) |

All strategies always include corrections first — they're the highest-value signal.

You can set a default strategy when creating RuleChef, or override per call:

In [None]:
# Override sampling for a single learn call
chef.learn_rules(sampling_strategy="diversity")

## Part 4: Under the Hood

What actually goes to the LLM? Let's look at the prompts RuleChef builds.

### The synthesis prompt

This is the prompt that asks the LLM to generate rules from examples. It includes:
1. **Task description** — what we're trying to do
2. **Data evidence** — patterns found in the examples (including auto-generated regex hints from `grex`)
3. **Training examples** — the actual input/output pairs
4. **User feedback** — any guidance you've added
5. **Format instructions** — how to write regex rules, with technique guides
6. **Response schema** — the exact JSON format to return

In [None]:
# Build the synthesis prompt (what the LLM sees)
prompt = chef.learner._build_synthesis_prompt(chef.dataset, max_rules=10)
print(f"Prompt length: {len(prompt)} chars")
print("=" * 80)
print(prompt[:3000])
print("\n... (truncated) ...")

### grex: regex hints from examples

[grex](https://github.com/pemistahl/grex) is a library that infers regex patterns from example strings. You give it strings, it gives you a regex that matches all of them. RuleChef uses this to give the LLM concrete pattern suggestions during rule synthesis.

Let's first see what grex does on its own, then see how RuleChef uses it in prompts.

In [None]:
import re

try:
    from grex import RegExpBuilder

    # Example 1: Dates — grex finds the structural pattern
    dates = ["2024-01-15", "2024-02-28", "2023-12-01", "2025-06-30"]
    exact = RegExpBuilder.from_test_cases(dates).without_anchors().build()
    generalized = (
        RegExpBuilder.from_test_cases(dates)
        .without_anchors()
        .with_conversion_of_digits()
        .with_conversion_of_repetitions()
        .build()
    )
    print("=== Dates ===")
    print(f"  Input:      {dates}")
    print(f"  Exact:      {exact}")
    print(f"  Generalized: {generalized}")
    # Verify the generalized pattern matches new dates
    print(f"  Matches '2026-03-14'? {bool(re.search(generalized, '2026-03-14'))}")
    print()

    # Example 2: Dollar amounts
    amounts = ["$10.99", "$5.00", "$249.95", "$1.50"]
    exact = RegExpBuilder.from_test_cases(amounts).without_anchors().build()
    generalized = (
        RegExpBuilder.from_test_cases(amounts)
        .without_anchors()
        .with_conversion_of_digits()
        .with_conversion_of_repetitions()
        .build()
    )
    print("=== Dollar amounts ===")
    print(f"  Input:      {amounts}")
    print(f"  Exact:      {exact}")
    print(f"  Generalized: {generalized}")
    print()

    # Example 3: Natural language — exact pattern is just alternation
    phrases = ["exchange rate", "current rates", "conversion rate"]
    exact = RegExpBuilder.from_test_cases(phrases).without_anchors().build()
    print("=== Natural language phrases ===")
    print(f"  Input:      {phrases}")
    print(f"  Exact:      {exact}")
    print("  (No structural pattern — text is too diverse to generalize)")

except ImportError:
    print("grex not installed. Install with: pip install rulechef[grex]")

In [None]:
# Extract just the data evidence section from the prompt
prompt_with_grex = chef.learner._build_synthesis_prompt(chef.dataset, max_rules=10)

# Find the data evidence section
start = prompt_with_grex.find("DATA EVIDENCE FROM TRAINING:")
end = prompt_with_grex.find("\n\n", start + 1) if start != -1 else -1
if start != -1:
    evidence_with = prompt_with_grex[start:end].strip()
    print("=== With grex (default) ===")
    print(evidence_with)
else:
    print("DATA EVIDENCE section not found — grex may not be installed.")
    print("Install with: pip install rulechef[grex]")

In [None]:
# Now disable grex and compare
chef.learner.prompt_builder.use_grex = False
prompt_without_grex = chef.learner._build_synthesis_prompt(chef.dataset, max_rules=10)

start = prompt_without_grex.find("DATA EVIDENCE FROM TRAINING:")
end = prompt_without_grex.find("\n\n", start + 1) if start != -1 else -1
if start != -1:
    evidence_without = prompt_without_grex[start:end].strip()
    print("=== Without grex ===")
    print(evidence_without)
else:
    print("No DATA EVIDENCE section — this task type may not generate one.")

# Restore grex
chef.learner.prompt_builder.use_grex = True

# Show the difference
print("\n" + "=" * 60)
print(f"Prompt length WITH grex:    {len(prompt_with_grex):,} chars")
print(f"Prompt length WITHOUT grex: {len(prompt_without_grex):,} chars")
print(f"grex adds {len(prompt_with_grex) - len(prompt_without_grex):,} chars of regex hints")

## Part 5: Agentic Learning

So far the learning loop uses simple heuristics: "did F1 improve? keep going." The **Agentic Coordinator** replaces those heuristics with an LLM that reads the per-class metrics after each iteration and writes targeted guidance: *"FREQUENCY recall is 60% — the rules miss 'at bedtime' and 'each morning', add alternation patterns for these."*

It also decides when to **stop early** — no point iterating if everything is at 100%.

To give it a real workout, we'll add 9 more varied examples: different dosage units (`mcg`, `units`), new frequency patterns (`"at bedtime"`, `"each morning"`, `"three times daily"`), filler text (`"before breakfast"`, `"taper over 2 weeks"`), and a multi-word drug (`"Insulin glargine"`).

In [None]:
from rulechef.coordinator import AgenticCoordinator

# Add more examples from the full dataset so the agentic coordinator has more to work with
more_examples = [
    (
        "Aspirin 81mg daily for cardiac prevention.",
        [
            {"text": "Aspirin", "start": 0, "end": 7, "type": "DRUG"},
            {"text": "81mg", "start": 8, "end": 12, "type": "DOSAGE"},
            {"text": "daily", "start": 13, "end": 18, "type": "FREQUENCY"},
            {"text": "cardiac prevention", "start": 23, "end": 41, "type": "CONDITION"},
        ],
    ),
    (
        "Omeprazole 20mg once daily before breakfast for acid reflux.",
        [
            {"text": "Omeprazole", "start": 0, "end": 10, "type": "DRUG"},
            {"text": "20mg", "start": 11, "end": 15, "type": "DOSAGE"},
            {"text": "once daily", "start": 16, "end": 26, "type": "FREQUENCY"},
            {"text": "acid reflux", "start": 48, "end": 59, "type": "CONDITION"},
        ],
    ),
    (
        "Atorvastatin 40mg at bedtime for high cholesterol.",
        [
            {"text": "Atorvastatin", "start": 0, "end": 12, "type": "DRUG"},
            {"text": "40mg", "start": 13, "end": 17, "type": "DOSAGE"},
            {"text": "at bedtime", "start": 18, "end": 28, "type": "FREQUENCY"},
            {"text": "high cholesterol", "start": 33, "end": 49, "type": "CONDITION"},
        ],
    ),
    (
        "Prednisone 60mg once daily for asthma, taper over 2 weeks.",
        [
            {"text": "Prednisone", "start": 0, "end": 10, "type": "DRUG"},
            {"text": "60mg", "start": 11, "end": 15, "type": "DOSAGE"},
            {"text": "once daily", "start": 16, "end": 26, "type": "FREQUENCY"},
            {"text": "asthma", "start": 31, "end": 37, "type": "CONDITION"},
        ],
    ),
    (
        "Warfarin 5mg once daily for deep vein thrombosis.",
        [
            {"text": "Warfarin", "start": 0, "end": 8, "type": "DRUG"},
            {"text": "5mg", "start": 9, "end": 12, "type": "DOSAGE"},
            {"text": "once daily", "start": 13, "end": 23, "type": "FREQUENCY"},
            {"text": "deep vein thrombosis", "start": 28, "end": 48, "type": "CONDITION"},
        ],
    ),
    (
        "Sertraline 50mg once daily for generalized anxiety disorder.",
        [
            {"text": "Sertraline", "start": 0, "end": 10, "type": "DRUG"},
            {"text": "50mg", "start": 11, "end": 15, "type": "DOSAGE"},
            {"text": "once daily", "start": 16, "end": 26, "type": "FREQUENCY"},
            {"text": "generalized anxiety disorder", "start": 31, "end": 59, "type": "CONDITION"},
        ],
    ),
    (
        "Levothyroxine 100mcg each morning for hypothyroidism.",
        [
            {"text": "Levothyroxine", "start": 0, "end": 13, "type": "DRUG"},
            {"text": "100mcg", "start": 14, "end": 20, "type": "DOSAGE"},
            {"text": "each morning", "start": 21, "end": 33, "type": "FREQUENCY"},
            {"text": "hypothyroidism", "start": 38, "end": 52, "type": "CONDITION"},
        ],
    ),
    (
        "Insulin glargine 18 units at bedtime for diabetes mellitus.",
        [
            {"text": "Insulin glargine", "start": 0, "end": 16, "type": "DRUG"},
            {"text": "18 units", "start": 17, "end": 25, "type": "DOSAGE"},
            {"text": "at bedtime", "start": 26, "end": 36, "type": "FREQUENCY"},
            {"text": "diabetes mellitus", "start": 41, "end": 58, "type": "CONDITION"},
        ],
    ),
    (
        "Gabapentin 300mg three times daily for neuropathic pain.",
        [
            {"text": "Gabapentin", "start": 0, "end": 10, "type": "DRUG"},
            {"text": "300mg", "start": 11, "end": 16, "type": "DOSAGE"},
            {"text": "three times daily", "start": 17, "end": 34, "type": "FREQUENCY"},
            {"text": "neuropathic pain", "start": 39, "end": 55, "type": "CONDITION"},
        ],
    ),
]

for text, entities in more_examples:
    chef.add_example({"text": text}, {"entities": entities})

print(f"Buffer: {chef.get_buffer_stats()['new_examples']} new examples")

In [None]:
# Switch to agentic coordinator
chef.coordinator = AgenticCoordinator(client, model=MODEL)

# Learn with more refinement iterations — watch the coordinator's guidance
# For NER, it will analyze per-entity-type precision/recall and guide synthesis
rules, eval_result = chef.learn_rules(max_refinement_iterations=5)

In [None]:
# Check the results
show_eval(chef.evaluate())

### Rule pruning

After multiple refinement iterations, the ruleset can get bloated with redundant rules. Enable `prune_after_learn` and the coordinator will merge similar patterns and remove pure noise — with a safety net that reverts if quality drops.

In [None]:
print(f"Rules before pruning: {len(chef.dataset.rules)}")

# Enable pruning
chef.coordinator = AgenticCoordinator(client, model=MODEL, prune_after_learn=True)

rules, eval_result = chef.learn_rules(max_refinement_iterations=3)

print(f"\nRules after pruning: {len(chef.dataset.rules)}")

## Part 6: LLM Observability

What if you already have an LLM extracting entities in production? Instead of defining a task and manually labeling examples, you can **wrap your existing OpenAI client** and RuleChef will:

1. **Observe** every LLM call — capturing messages and responses
2. **Auto-discover** the task schema from the patterns it sees
3. **Map** observations to the discovered schema
4. **Learn rules** that replicate the LLM's behavior

```
Your code → wrapped_client.chat.completions.create(...) → LLM responds
                         ↓ (observed)
                   RuleChef captures raw interactions
                         ↓ (at learn time)
                   Discover task → Map to schema → Learn rules
```

**Zero-config**: no Task definition needed, no custom extractors. Just wrap and go.

In [None]:
# Zero-config: observe an LLM extracting entities from prescriptions
storage_obs = tempfile.mkdtemp(prefix="rulechef_observe_")
observer_chef = RuleChef(
    client=client,  # No task!
    dataset_name="medical_observed",
    storage_path=storage_obs,
    model=MODEL,
    allowed_formats=[RuleFormat.REGEX],
    use_grex=True,
)

# Wrap the client — returns the same client object, now observed
wrapped_client = observer_chef.start_observing(client, auto_learn=False)
print("Observing... make some LLM calls.")

In [None]:
import json

# Simulate your existing LLM pipeline — extract entities from prescriptions
# RuleChef silently captures every call (messages + response).
prescriptions = [
    "Metformin 500mg twice daily for type 2 diabetes.",
    "Take lisinopril 10mg once daily for high blood pressure.",
    "Ibuprofen 400mg every 6 hours as needed for pain.",
    "Aspirin 81mg daily for cardiac prevention.",
    "Omeprazole 20mg once daily before breakfast for acid reflux.",
    "Atorvastatin 40mg at bedtime for high cholesterol.",
    "Prednisone 60mg once daily for asthma, taper over 2 weeks.",
    "Warfarin 5mg once daily for deep vein thrombosis.",
]

system_prompt = (
    "Extract medical entities from the prescription text. "
    "Return JSON with an 'entities' list. Each entity has: "
    "'text' (exact span), 'start' (char offset), 'end' (char offset), "
    "and 'type' (one of: DRUG, DOSAGE, FREQUENCY, CONDITION). "
    "Return only valid JSON, no explanation."
)

for prescription in prescriptions:
    response = wrapped_client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prescription},
        ],
        temperature=0,
    )
    result = response.choices[0].message.content.strip()
    try:
        entities = json.loads(result).get("entities", [])
        types = ", ".join(e.get("type", "?") for e in entities)
        print(f"{prescription[:55]:57s} → [{types}]")
    except Exception:
        print(f"{prescription[:55]:57s} → (parse error)")

# Check what was captured
print(f"\nObserver stats: {observer_chef.get_buffer_stats().get('observer', {})}")

In [None]:
# learn_rules() does everything:
#   1. Discovers the task schema from raw observations (since no task was provided)
#   2. Maps each observation to the discovered schema
#   3. Commits mapped examples to buffer → dataset
#   4. Synthesizes + refines rules
rules, eval_result = observer_chef.learn_rules()

# Show what was discovered
print(f"\nDiscovered task: {observer_chef.task.name}")
print(f"  Type: {observer_chef.task.type.value}")
print(f"  Input schema: {observer_chef.task.input_schema}")
print(f"  Output schema: {observer_chef.task.output_schema}")

# Stop observing — restores the original client
observer_chef.stop_observing()

# The rules work without any LLM calls
for text in [
    "Warfarin 5mg once daily for deep vein thrombosis.",
    "Tramadol 50mg every 6 hours for chronic back pain.",
]:
    result = observer_chef.extract({"text": text})
    entities = result.get("entities", [])
    print(f"\n{text}")
    for e in entities:
        print(f"  [{e.get('type', '?')}] {e.get('text', '?')!r}")
    if not entities:
        print("  (no entities)")

## Part 7: Scale Up

Everything above used NER on hand-crafted prescriptions. But does RuleChef work on real, messy data — and on a different task type?

[Banking77](https://huggingface.co/datasets/legacy-datasets/banking77) has 13,000 banking customer queries across 77 intent classes. We'll pick a subset of classes, feed RuleChef all available training data, and let the **agentic coordinator** handle the rest — per-class synthesis, targeted refinement, early stopping.

Key settings to experiment with:
- `CLASSES` — which classes to learn (try more or fewer)
- `max_refinement_iterations` — how many refinement rounds (more = better but slower)
- `use_grex` — regex hints from example strings
- `AgenticCoordinator` — LLM-guided refinement that steers toward weak classes

In [None]:
import uuid
from collections import defaultdict

from datasets import load_dataset

from rulechef.coordinator import AgenticCoordinator
from rulechef.core import Dataset, Example

# --- Configuration ---
CLASSES = [
    "exchange_rate",
    "card_arrival",
    "beneficiary_not_allowed",
    "disposable_card_limits",
    "pending_cash_withdrawal",
]
MAX_REFINEMENT_ITERATIONS = 15  # more iterations = better rules, slower learning

# --- Task definition ---
banking_task = Task(
    name="Banking Intent Classification",
    description=(
        f"Classify banking customer queries into one of these intents: {', '.join(CLASSES)}"
    ),
    input_schema={"text": "str"},
    output_schema={"label": "str"},
    type=TaskType.CLASSIFICATION,
    text_field="text",
)

# --- Load Banking77 ---
ds = load_dataset("legacy-datasets/banking77")
label_names = ds["train"].features["label"].names

train_records = [{"text": r["text"], "label": label_names[r["label"]]} for r in ds["train"]]
test_records = [{"text": r["text"], "label": label_names[r["label"]]} for r in ds["test"]]

# Split by class
by_label = defaultdict(list)
for ex in train_records:
    if ex["label"] in CLASSES:
        by_label[ex["label"]].append(ex)

train_data = [ex for label in sorted(CLASSES) for ex in by_label[label]]
test_data = [ex for ex in test_records if ex["label"] in CLASSES]

print(f"Banking77: {len(label_names)} total classes, using {len(CLASSES)}")
print(f"Training:  {len(train_data)} examples")
print(f"Test:      {len(test_data)} examples (held out)")
for label in sorted(CLASSES):
    print(f"  {label:30s} | {len(by_label[label]):3d} train")

In [None]:
# Example data
print("\nExample record:")
print(train_data[0])
print(test_data[0])

In [None]:
storage2 = tempfile.mkdtemp(prefix="rulechef_banking77_")

chef2 = RuleChef(
    banking_task,
    client,
    dataset_name="banking77_5class",
    storage_path=storage2,
    model=MODEL,
    allowed_formats=[RuleFormat.REGEX],
    use_grex=True,  # regex hints from examples
    max_samples=100,  # max examples per LLM prompt (per-class + patches)
    max_rules_per_class=10,  # rules generated per class
    max_counter_examples=10,  # negative examples per class prompt
    coordinator=AgenticCoordinator(
        client, model=MODEL
    ),  # LLM-guided refinement + per-class synthesis
)

for ex in train_data:
    chef2.add_example({"text": ex["text"]}, {"label": ex["label"]})

print(f"Added {len(train_data)} examples, learning...")
rules, eval_result = chef2.learn_rules(max_refinement_iterations=MAX_REFINEMENT_ITERATIONS)

### Evaluate on held-out test data

The rules were learned from the training split. Let's see how they generalize to a fully held-out test set — examples the rules have never seen:

In [None]:
import time

from rulechef.evaluation import evaluate_dataset

# Build a held-out test Dataset
test_dataset = Dataset(name="banking77_test", task=banking_task)
for ex in test_data:
    test_dataset.examples.append(
        Example(
            id=str(uuid.uuid4())[:8],
            input={"text": ex["text"]},
            expected_output={"label": ex["label"]},
            source="benchmark",
        )
    )

# Evaluate on test set
t0 = time.time()
test_eval = evaluate_dataset(rules, test_dataset, chef2.learner._apply_rules)
elapsed = time.time() - t0

print(
    f"Held-out test: {len(test_data)} examples, {len(rules)} rules, {elapsed / len(test_data) * 1000:.2f}ms/query"
)
show_eval(test_eval)
show_failures(test_eval)

### Persistence

Rules are automatically saved to disk as JSON. You can reload them by creating a new `RuleChef` with the same `dataset_name` and `storage_path`.

In [None]:
import json
from pathlib import Path

# Show the saved file
saved_file = Path(storage2) / "banking77_5class.json"
data = json.loads(saved_file.read_text())
print(f"Saved to: {saved_file}")
print(f"File size: {saved_file.stat().st_size / 1024:.1f} KB")
print(f"Contains: {len(data.get('rules', []))} rules, {len(data.get('examples', []))} examples")

In [None]:
# Recreate chef from disk — rules load automatically
chef3 = RuleChef(
    banking_task,
    client,
    dataset_name="banking77_5class",
    storage_path=storage2,
    model=MODEL,
    allowed_formats=[RuleFormat.REGEX],
)
print(f"Loaded {len(chef3.dataset.rules)} rules from disk")

# Works immediately
result = chef3.extract({"text": "what's the exchange rate for dollars?"})
print(f"Extract: {result}")

# Measure throughput
t0 = time.time()
for ex in test_data[:100]:
    chef3.extract({"text": ex["text"]})
elapsed = time.time() - t0
print(f"\n100 queries in {elapsed * 1000:.1f}ms ({elapsed / 100 * 1000:.2f}ms per query)")

## Summary

In this tutorial you learned how to use RuleChef on a medical NER task (extracting DRUG, DOSAGE, FREQUENCY, and CONDITION entities from prescription text), then benchmarked it on a classification task to show it works across task types.

| # | What you learned | Key API |
|---|-----------------|---------| 
| 1 | Define a task — describe entity types and schema | `Task(...)` |
| 2 | Add examples — input text + labeled entity spans | `chef.add_example()` |
| 3 | Learn rules — the LLM writes regex rules from examples | `chef.learn_rules()` |
| 4 | Inspect rules — see patterns, entity types, priorities | `chef.get_rules_summary()`, `chef.dataset.rules` |
| 5 | Extract — run NER with sub-ms latency, no LLM needed | `chef.extract()` |
| 6 | Evaluate — precision/recall/F1 per entity type and per rule | `chef.evaluate()`, `chef.get_rule_metrics()` |
| 7 | Correct mistakes — add corrections and re-learn incrementally | `chef.add_correction()` |
| 8 | Give feedback — steer rule generation with natural language | `chef.add_feedback()` |
| 9 | Sampling strategies — control which examples the LLM sees | `sampling_strategy="diversity"` |
| 10 | Under the hood — inspect synthesis and patch prompts, grex hints | `learner._build_synthesis_prompt()` |
| 11 | Agentic learning — LLM-guided refinement per entity type | `AgenticCoordinator` |
| 12 | Rule pruning — merge redundant rules with safety net | `prune_after_learn=True` |
| 13 | LLM observability — wrap your client, auto-discover task, learn rules | `chef.start_observing()` |
| 14 | Scale up — from 3 examples to a real benchmark dataset | `Banking77` |
| 15 | Persistence — rules save to disk and reload automatically | `storage_path=` |

**Docs & further reading:**
- [How It Works](https://krlabsorg.github.io/rulechef/getting-started/concepts/) — architecture, buffer, rules, schemas, coordinators
- [Full documentation](https://krlabsorg.github.io/rulechef) — installation, guides, API reference
- [Task types](https://krlabsorg.github.io/rulechef/guide/task-types/) — NER, extraction, classification, transformation
- [Learning & refinement](https://krlabsorg.github.io/rulechef/guide/learning/) — buffer architecture, sampling, incremental patching
- [Evaluation & feedback](https://krlabsorg.github.io/rulechef/guide/evaluation/) — metrics, corrections, custom matchers
- [Coordinators](https://krlabsorg.github.io/rulechef/guide/coordinators/) — simple vs agentic, rule pruning
- [Advanced features](https://krlabsorg.github.io/rulechef/guide/advanced/) — observation mode, spaCy patterns, training data logger
- [GitHub](https://github.com/KRLabsOrg/rulechef)

**What to try next:**
- Add more examples from your own data
- Use corrections when the extractor misses entities
- Try `RuleFormat.CODE` for more complex matching logic (e.g., drug name lists)
- Try `RuleFormat.SPACY` for dependency-aware patterns
- Use the CLI for interactive exploration: `rulechef`