# Creating Custom Components

This tutorial shows two approaches to extend **autochecklist**:

### Easy: Custom Prompts from `.md` Files
Write a prompt template in a `.md` file and load it — no Python subclassing needed.
- **Custom generator prompts** — Write your generation prompt with `{input}` placeholder
- **Custom scorer prompts** — Write your scoring prompt with `{input}`, `{target}`, `{checklist}` placeholders

### Advanced: Subclass from Base Classes
For full control, create your own Python classes:
- **Generators** — Create checklists from inputs
- **Scorers** — Evaluate targets against checklists
- **Refiners** — Improve checklists (filter, deduplicate, etc.)

The library uses a **Template Method** pattern: base classes provide common infrastructure (LLM calls, metadata handling), and you only implement the domain-specific logic.

## Setup

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

# OpenRouter is the default provider. Custom components automatically inherit
# multi-provider support (provider, base_url, client, api_format params)
# via the base class. See pipeline_demo.ipynb for provider examples.
assert os.getenv("OPENROUTER_API_KEY"), "Please set OPENROUTER_API_KEY in .env file"

MODEL = "openai/gpt-5-mini"

---
## 1. Custom Prompts from `.md` Files (Easy Way)

The simplest way to create a custom generator or scorer is to write a prompt template as a `.md` file. No Python subclassing needed.

### Custom Generator Prompt

Write a `.md` file with an `{input}` placeholder. The LLM response is parsed automatically (numbered lists, bulleted lists, `[[bracketed]]` format all work).

In [2]:
# Let's look at our custom generator prompt
with open("my_generator.md") as f:
    print(f.read())

You are an evaluation expert. Generate a checklist of yes/no questions to evaluate an AI response.

**Instruction**: {input}

Create 3-5 concise yes/no questions that assess:
- Whether the response addresses the instruction
- Quality and correctness of the response
- Completeness of the response

Output each question on its own line as a numbered list:
1. Does the response ...?
2. Is the response ...?



In [3]:
# Use it directly with DirectGenerator
from pathlib import Path
from autochecklist import DirectGenerator

gen = DirectGenerator(custom_prompt=Path("my_generator.md"), model=MODEL)
checklist = gen.generate(input="Write a limerick about a cat")

print(f"Generated {len(checklist)} items:")
print(checklist.to_text())

Generated 4 items:
1. Does the response write a limerick specifically about a cat?
2. Does the poem follow limerick form (five lines with an AABBA rhyme scheme)?
3. Does the limerick maintain an appropriate meter/rhythm for a limerick?
4. Is the language clear, grammatical, and engaging?


In [4]:
# Or use it via pipeline() — even easier
# First register the custom generator, then use it by name
from autochecklist import pipeline, register_custom_generator

register_custom_generator("my_gen", "my_generator.md")
pipe = pipeline("my_gen", generator_model=MODEL, scorer_model=MODEL)

result = pipe(
    input="Write a limerick about a cat",
    target="There once was a cat named Sue\nWho loved to eat fish and stew\nShe'd purr all day\nIn a cozy way\nAnd nap under skies of blue",
)

print(f"Pass rate: {result.pass_rate:.0%}")
print(f"\nChecklist:")
print(result.checklist.to_text())

Pass rate: 100%

Checklist:
1. Does the response follow the instruction to write a limerick about a cat?
2. Is the response in limerick form (five lines with an AABBA rhyme scheme)?
3. Is the language clear and free of grammatical errors?
4. Is the limerick complete and coherent (five lines expressing a single idea)?


### Custom Scorer Prompt

`ChecklistScorer` already accepts a `custom_prompt` parameter. You can register a custom scorer prompt via `register_custom_scorer()` and then use it by name in the pipeline.

Your scorer prompt should use `{input}`, `{target}`, and `{checklist}` placeholders, and instruct the LLM to output `Q1: YES/NO, Q2: YES/NO, ...`

`register_custom_scorer()` also accepts optional config kwargs:
- `mode` — `"batch"` (default) or `"item"`
- `primary_metric` — `"pass"` (default), `"weighted"`, or `"normalized"` (auto-enables logprobs)
- `capture_reasoning` — include per-item reasoning

In [5]:
# Look at our custom scorer prompt
with open("my_scorer.md") as f:
    print(f.read())

You are a strict but fair evaluator. Assess whether the response meets each criterion.

**Instruction**: {input}

**Response to evaluate**: {target}

**Evaluation criteria**:
{checklist}

For each question above, determine if the response satisfies the criterion.
Be strict — only answer YES if the criterion is clearly and fully met.

Output your answers in this exact format:
Q1: YES or NO
Q2: YES or NO
(continue for all questions)



In [6]:
# Use custom generator + custom scorer together via pipeline
# Register both custom prompts, then use them by name
from autochecklist import register_custom_scorer

register_custom_generator("my_eval", "my_generator.md")
register_custom_scorer("strict_eval", "my_scorer.md")

pipe = pipeline("my_eval", generator_model=MODEL, scorer="strict_eval")

result = pipe(
    input="Explain photosynthesis in one sentence",
    target="Photosynthesis is when plants use sunlight, water, and CO2 to make glucose and oxygen.",
)

print(f"Pass rate: {result.pass_rate:.0%}")
print(f"\nChecklist:")
print(result.checklist.to_text())
print(f"\nItem-level results:")
for item, score in zip(result.checklist.items, result.score.item_scores):
    print(f"  {score.answer.value.upper():3} - {item.question}")

Pass rate: 100%

Checklist:
1. 1. Does the response provide a single-sentence explanation of photosynthesis?
2. 2. Is the explanation factually accurate about converting light energy, carbon dioxide, and water into glucose and oxygen?
3. 3. Is the sentence clear and easily understandable?
4. 4. Does the sentence include the essential components or outcomes of photosynthesis (light energy, CO2, water, and the primary products) within the one-sentence constraint?

Item-level results:
  YES - 1. Does the response provide a single-sentence explanation of photosynthesis?
  YES - 2. Is the explanation factually accurate about converting light energy, carbon dioxide, and water into glucose and oxygen?
  YES - 3. Is the sentence clear and easily understandable?
  YES - 4. Does the sentence include the essential components or outcomes of photosynthesis (light energy, CO2, water, and the primary products) within the one-sentence constraint?


### Registering Custom Prompts for Reuse

Register a custom generator or scorer under a name so you can use it like any built-in method.

In [7]:
from autochecklist import register_custom_generator, register_custom_scorer, list_generators, list_scorers

# Register a custom generator under a name
register_custom_generator("my_eval", "my_generator.md")

# Register a custom scorer under a name
register_custom_scorer("strict_eval", "my_scorer.md")

# Now they appear in the registry
print("Generators:", list_generators())
print("Scorers:", list_scorers())

# Use by name, just like built-in methods
pipe = pipeline("my_eval", generator_model=MODEL, scorer="strict_eval")
result = pipe("Write a haiku about rain", target="Drops fall from the sky\nPuddles form on empty streets\nEarth drinks deeply now")
print(f"\nPass rate: {result.pass_rate:.0%}")

Generators: ['tick', 'rocketeval', 'rlcf_direct', 'rlcf_candidate', 'rlcf_candidates_only', 'feedback', 'checkeval', 'interacteval', 'my_gen', 'my_eval']
Scorers: ['batch', 'item', 'weighted', 'normalized', 'strict_eval']



Pass rate: 100%


### Scorer Config: Mode, Primary Metric, and More

When registering custom scorers or pipelines, you can configure the scorer's behavior — not just the prompt. This gives you the same flexibility as built-in presets like `rlcf_direct` or `rocketeval`.

In [None]:
# register_custom_scorer with config kwargs
# By default, custom scorers use mode="batch". You can configure the full
# scorer behavior: mode, primary_metric, capture_reasoning.
# Setting primary_metric="normalized" auto-enables logprobs.

register_custom_scorer(
    "weighted_strict",
    "my_scorer.md",
    mode="item",              # evaluate one item per LLM call
    primary_metric="weighted",  # Score.primary_score uses weighted_score
)

print("Registered 'weighted_strict' scorer with mode=item, primary_metric=weighted")
print("Scorers:", list_scorers())

In [9]:
# register_custom_pipeline with full scorer config
# Instead of just scorer="weighted", specify the exact scorer behavior:
from autochecklist import register_custom_pipeline
from pathlib import Path

# Example 1: with a built-in scorer prompt name ("rlcf", "rocketeval")
register_custom_pipeline(
    "my_weighted_eval",
    generator_prompt="Generate yes/no evaluation questions for:\n\n{input}",
    scorer_mode="item",              # one item per LLM call
    primary_metric="weighted",       # Score.primary_score = weighted_score
    scorer_prompt="rlcf",            # built-in scorer prompt name
)

# Example 2: with a custom scorer prompt file
register_custom_pipeline(
    "my_custom_eval",
    generator_prompt="Generate yes/no evaluation questions for:\n\n{input}",
    scorer_mode="batch",
    scorer_prompt=Path("my_scorer.md"),  # reads file at registration time
    force=True,  # allow re-registration
)

# Use it like any built-in preset
pipe = pipeline("my_weighted_eval", generator_model=MODEL, scorer_model=MODEL)
print("Pipeline created with scorer mode:", pipe.scorer.mode)
print("Primary metric:", pipe.scorer.primary_metric)

Pipeline created with scorer mode: item
Primary metric: weighted


In [10]:
# Save and load pipeline configs — scorer settings are preserved
from autochecklist import save_pipeline_config, load_pipeline_config
import json as _json

save_pipeline_config("my_weighted_eval", "my_weighted_eval.json")

# Inspect the saved JSON — scorer config is stored as flat keys
with open("my_weighted_eval.json") as f:
    config = _json.load(f)
print("Saved config:")
print(_json.dumps(config, indent=2))

# Reload on another machine or in another session
loaded_name = load_pipeline_config("my_weighted_eval.json", force=True)
print(f"\nLoaded pipeline: '{loaded_name}'")

# Clean up
import os
os.remove("my_weighted_eval.json")

Saved config:
{
  "name": "my_weighted_eval",
  "generator_class": "direct",
  "generator_prompt": "Generate yes/no evaluation questions for:\n\n{input}",
  "scorer_mode": "item",
  "scorer_prompt": "rlcf",
  "primary_metric": "weighted",
  "use_logprobs": false,
  "capture_reasoning": false
}

Loaded pipeline: 'my_weighted_eval'


---
## 2. Custom Generator (Advanced — Subclassing)

For full control over generation logic, create a Python class:

1. Inherit from `InstanceChecklistGenerator` (one checklist per input) or `CorpusChecklistGenerator` (one checklist per dataset)
2. Implement `method_name` property and `generate()` method
3. Use `@register_generator("name")` to make it discoverable

**Available helpers:**
- `self._call_model(prompt)` - Call the LLM
- `self.model`, `self.temperature` - Configuration

In [11]:
import re
from typing import Optional, Any

from autochecklist import (
    InstanceChecklistGenerator,
    Checklist,
    ChecklistItem,
    register_generator,
)


@register_generator("simple")
class SimpleGenerator(InstanceChecklistGenerator):
    """A minimal generator that creates 3 checklist items from an input."""

    @property
    def method_name(self) -> str:
        return "simple"

    def generate(
        self,
        input: str,
        target: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any,
    ) -> Checklist:
        # Create the prompt
        prompt = f"""Given this input: \"{input}\"

Generate exactly 3 yes/no questions to evaluate a response.
Format each question on its own line, starting with a number:
1. <question>
2. <question>
3. <question>"""

        # Call the LLM (uses self.model, self.temperature automatically)
        raw_response = self._call_model(prompt)

        # Parse the response into ChecklistItems
        items = []
        for line in raw_response.strip().split("\n"):
            # Match lines like "1. Is the response..." or "1) Is the response..."
            match = re.match(r"^\d+[.)\s]+(.+)$", line.strip())
            if match:
                question = match.group(1).strip()
                items.append(ChecklistItem(question=question))

        # Return a Checklist
        return Checklist(
            items=items,
            source_method=self.method_name,
            generation_level=self.generation_level,  # "instance" from base class
            input=input,
            metadata={"raw_response": raw_response},
        )


print("SimpleGenerator registered!")

SimpleGenerator registered!


In [12]:
# Verify it appears in the registry
from autochecklist import list_generators

print("Available generators:", list_generators())
assert "simple" in list_generators()

Available generators: ['tick', 'rocketeval', 'rlcf_direct', 'rlcf_candidate', 'rlcf_candidates_only', 'feedback', 'checkeval', 'interacteval', 'my_gen', 'my_eval', 'my_weighted_eval', 'my_custom_eval', 'simple']


In [13]:
# Use it directly
generator = SimpleGenerator(model=MODEL)
checklist = generator.generate(input="Write a haiku about autumn")

print(f"Generated {len(checklist)} items:")
print(checklist.to_text())

Generated 3 items:
1. Is the response written as a three-line poem (haiku format)?
2. Does the poem evoke autumn through imagery, words, or sensory details?
3. Does the poem adhere to the traditional 5-7-5 syllable structure?


In [14]:
# Use it via pipeline() - the recommended way
from autochecklist import pipeline

pipe = pipeline("simple", generator_model=MODEL, scorer_model=MODEL)

result = pipe(
    input="Write a haiku about autumn",
    target="Leaves fall gently down\nCrisp air whispers through the trees\nNature's final bow",
)

print(f"Pass rate: {result.pass_rate:.0%}")

Pass rate: 100%


---
## 3. Custom Scorer (Advanced — Subclassing)

Scorers evaluate targets against checklists. To create one:

1. Inherit from `ChecklistScorer`
2. Implement `scoring_method` property and `score()` method
3. Use `@register_scorer("name")` to make it discoverable

**Available helpers:**
- `self._call_model(prompt)` - Call the LLM
- `score_batch()` - Default batch implementation (override for efficiency)

In [15]:
from autochecklist import (
    ChecklistScorer,
    Score,
    ItemScore,
    ChecklistItemAnswer,
    register_scorer,
)


@register_scorer("strict")
class StrictScorer(ChecklistScorer):
    """A scorer that evaluates each item and computes pass rate.
    
    Uses a simple prompt to determine YES/NO for each checklist item.
    """

    @property
    def scoring_method(self) -> str:
        return "strict"

    def score(
        self,
        checklist: Checklist,
        target: str,
        input: Optional[str] = None,
        **kwargs: Any,
    ) -> Score:
        item_scores = []

        for item in checklist.items:
            # Create evaluation prompt
            prompt = f"""Evaluate whether this response satisfies the criterion.

Input: {input or checklist.input or 'N/A'}

Response: {target}

Criterion: {item.question}

Answer with only YES or NO."""

            # Get LLM judgment
            raw_answer = self._call_model(prompt).strip().upper()
            
            # Parse answer
            if "YES" in raw_answer:
                answer = ChecklistItemAnswer.YES
            elif "NO" in raw_answer:
                answer = ChecklistItemAnswer.NO
            else:
                answer = ChecklistItemAnswer.NA

            item_scores.append(ItemScore(item_id=item.id, answer=answer))

        # Calculate aggregate score
        yes_count = sum(1 for s in item_scores if s.answer == ChecklistItemAnswer.YES)
        total = len(item_scores)
        pass_rate = yes_count / total if total > 0 else 0.0

        return Score(
            checklist_id=checklist.id,
            item_scores=item_scores,
            total_score=pass_rate,
            judge_model=self.model,
            scoring_method=self.scoring_method,
        )


print("StrictScorer registered!")

StrictScorer registered!


In [16]:
# Verify it appears in the registry
from autochecklist import list_scorers

print("Available scorers:", list_scorers())
assert "strict" in list_scorers()

Available scorers: ['batch', 'item', 'weighted', 'normalized', 'strict_eval', 'weighted_strict', 'strict']


In [17]:
# Use it via pipeline
pipe = pipeline("simple", generator_model=MODEL, scorer="strict")

result = pipe(
    input="Write a haiku about autumn",
    target="Leaves fall gently down\nCrisp air whispers through the trees\nNature's final bow",
)

print(f"Pass rate (strict scorer): {result.pass_rate:.0%}")
print("\nItem-level results:")
for item, score in zip(result.checklist.items, result.score.item_scores):
    print(f"  {score.answer.value.upper():3} - {item.question}")

Pass rate (strict scorer): 100%

Item-level results:
  YES - Is the poem written in three lines (typical haiku form)?
  YES - Does the poem follow the 5-7-5 syllable pattern?
  YES - Does the poem clearly evoke autumn through words or imagery?


---
## 4. Custom Refiner (Advanced — Subclassing)

Refiners improve checklists by filtering, deduplicating, or transforming items. To create one:

1. Inherit from `ChecklistRefiner`
2. Implement `refiner_name` property and `refine()` method
3. Use `@register_refiner("name")` to make it discoverable

**Available helpers:**
- `self._call_model(prompt, response_format=...)` - Call LLM with optional JSON schema
- `self._create_refined_checklist(original, items, metadata_updates)` - Preserve metadata

In [18]:
from autochecklist import ChecklistRefiner, register_refiner


@register_refiner("length_filter")
class LengthFilter(ChecklistRefiner):
    """Filter out checklist items that are too short or too long.
    
    This is a simple refiner that doesn't use LLM calls.
    """

    @property
    def refiner_name(self) -> str:
        return "length_filter"

    def refine(
        self,
        checklist: Checklist,
        min_length: int = 20,
        max_length: int = 200,
        **kwargs: Any,
    ) -> Checklist:
        # Filter items by question length
        filtered_items = [
            item
            for item in checklist.items
            if min_length <= len(item.question) <= max_length
        ]

        # Track what was filtered
        removed = len(checklist.items) - len(filtered_items)
        metadata_updates = {
            "length_filter_removed": removed,
            "min_length": min_length,
            "max_length": max_length,
        }

        # Use helper to preserve original metadata
        return self._create_refined_checklist(
            original=checklist,
            items=filtered_items,
            metadata_updates=metadata_updates,
        )


print("LengthFilter registered!")

LengthFilter registered!


In [19]:
# Verify it appears in the registry
from autochecklist import list_refiners

print("Available refiners:", list_refiners())
assert "length_filter" in list_refiners()

Available refiners: ['deduplicator', 'tagger', 'unit_tester', 'selector', 'length_filter']


In [20]:
# Use it directly
refiner = LengthFilter()

# Create a test checklist with varying question lengths
test_checklist = Checklist(
    items=[
        ChecklistItem(question="Short?"),  # Too short (7 chars)
        ChecklistItem(question="Is this a good length question?"),  # OK (31 chars)
        ChecklistItem(question="Does the response adequately address the topic?"),  # OK (47 chars)
        ChecklistItem(question="X" * 250),  # Too long (250 chars)
    ],
    source_method="test",
    generation_level="instance",
)

print(f"Before: {len(test_checklist)} items")
for item in test_checklist.items:
    print(f"  [{len(item.question):3} chars] {item.question[:50]}...")

refined = refiner.refine(test_checklist, min_length=20, max_length=200)

print(f"\nAfter: {len(refined)} items")
for item in refined.items:
    print(f"  [{len(item.question):3} chars] {item.question[:50]}")

print(f"\nMetadata: {refined.metadata}")

Before: 4 items
  [  6 chars] Short?...
  [ 31 chars] Is this a good length question?...
  [ 47 chars] Does the response adequately address the topic?...
  [250 chars] XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...

After: 2 items
  [ 31 chars] Is this a good length question?
  [ 47 chars] Does the response adequately address the topic?

Metadata: {'refined_by': 'length_filter', 'original_count': 4, 'length_filter_removed': 2, 'min_length': 20, 'max_length': 200}


---
## 5. Putting It All Together

Custom components work seamlessly with the pipeline API and registry.

In [21]:
from autochecklist import pipeline, ChecklistPipeline, get_generator, get_scorer, get_refiner

# Method 1: Use pipeline() with registered names
pipe = pipeline("simple", generator_model=MODEL, scorer="strict")
result = pipe("Write a poem", target="Roses are red, violets are blue.")
print(f"Pipeline result: {result.pass_rate:.0%}")

# Method 2: Get classes from registry and instantiate manually
gen_cls = get_generator("simple")
scorer_cls = get_scorer("strict")

generator = gen_cls(model=MODEL)
scorer = scorer_cls(model=MODEL)

checklist = generator.generate(input="Write a poem")
score = scorer.score(checklist, target="Roses are red, violets are blue.")
print(f"Manual result: {score.pass_rate:.0%}")

Pipeline result: 100%


Manual result: 33%


---
## Quick Reference

### Easy: `.md` File Approach

| What | How | Example |
|------|-----|--------|
| Custom generator | `register_custom_generator("name", "file.md")` then `pipeline("name", generator_model=MODEL)` | `.md` with `{input}` |
| Custom scorer | `register_custom_scorer("name", "file.md")` then `pipeline("tick", scorer="name")` | `.md` with `{input}`, `{target}`, `{checklist}` |
| Custom scorer (configured) | `register_custom_scorer("name", "file.md", mode="item", primary_metric="weighted")` | Supports `mode`, `primary_metric`, `capture_reasoning` |
| Custom pipeline | `register_custom_pipeline("name", generator_prompt="...", scorer_mode="item", primary_metric="weighted")` | Full scorer config via flat kwargs |
| Both custom | Register both, then `pipeline("gen_name", scorer="scorer_name")` | |

### Advanced: Subclassing

| Component | Base Class | Required Methods | Decorator |
|-----------|------------|------------------|----------|
| Generator (instance) | `InstanceChecklistGenerator` | `method_name`, `generate(input, ...)` | `@register_generator("name")` |
| Generator (corpus) | `CorpusChecklistGenerator` | `method_name`, `generate(inputs, ...)` | `@register_generator("name")` |
| Scorer | `ChecklistScorer` | `scoring_method`, `score(checklist, target, ...)` | `@register_scorer("name")` |
| Refiner | `ChecklistRefiner` | `refiner_name`, `refine(checklist, ...)` | `@register_refiner("name")` |

**Return types:**
- Generators return `Checklist` (with `ChecklistItem` objects)
- Scorers return `Score` (with `ItemScore` objects and aggregates)
- Refiners return `Checklist` (refined version)

**For more complex examples**, see the built-in implementations:
- Generators: `autochecklist/generators/template.py`
- Scorers: `autochecklist/scorers/base.py`
- Refiners: `autochecklist/refiners/deduplicator.py`