# AutoChecklist Pipeline Demo

This notebook demonstrates the **Pipeline API** for easy checklist generation and scoring.

The pipeline API provides:
- **One-liner evaluation**: `pipeline("tick")(input="...", target="...")`
- **Custom prompts**: Load generator/scorer prompts from `.md` files
- **Component registry**: Discover and instantiate components by name
- **Batch evaluation**: Score entire datasets with progress bars
- **Config-driven usage**: Create pipelines from configuration dicts

## Setup

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# OpenRouter is the default provider. Set OPENROUTER_API_KEY in .env
# For other providers: OPENAI_API_KEY (OpenAI direct), or use vLLM (no key needed)
assert os.getenv("OPENROUTER_API_KEY"), "Please set OPENROUTER_API_KEY in .env file"

MODEL = "openai/gpt-5-mini"

---
## 1. Component Registry

The registry allows you to discover available components without knowing their class names.

In [2]:
from autochecklist import list_generators, list_scorers, list_refiners

print("Available generators:", list_generators())
print("Available scorers:", list_scorers())
print("Available refiners:", list_refiners())

Available generators: ['tick', 'rocketeval', 'rlcf_direct', 'rlcf_candidate', 'rlcf_candidates_only', 'feedback', 'checkeval', 'interacteval']
Available scorers: ['batch', 'item', 'weighted', 'normalized']
Available refiners: ['deduplicator', 'tagger', 'unit_tester', 'selector']


In [3]:
# Get detailed info for UI dropdowns
from autochecklist.registry import list_generators_with_info

for info in list_generators_with_info():
    ref = "  [needs ref]" if info.get("requires_reference") else ""
    scorer = info.get("default_scorer") or "—"
    print(f"  {info['name']:22} [{info['level']:8}] {info['description']}")
    if info.get("detail"):
        print(f"  {'':22}  └─ {info['detail']}")
    print(f"  {'':22}  scorer: {scorer}{ref}")
    print()

  tick                   [instance] Few-shot checklist generation from input text
                          └─ Uses a TICK-style prompt to generate yes/no checklist items from the input alone. No reference needed.
                          scorer: batch

  rocketeval             [instance] Confidence-aware checklist from input + reference
                          └─ Generates checklist items informed by a reference response, scored via logprobs for calibrated confidence levels.
                          scorer: normalized  [needs ref]

  rlcf_direct            [instance] Weighted criteria from input + reference
                          └─ Generates weighted checklist items (0-100 importance) by comparing the input against a reference response.
                          scorer: weighted  [needs ref]

  rlcf_candidate         [instance] Contrastive generation with candidates + reference
                          └─ Auto-generates candidate responses, then creates weighted checklist ite

---
## 2. Simple Pipeline - One-liner Evaluation

The `pipeline()` function creates a ready-to-use pipeline (like HuggingFace's `transformers.pipeline`).

In [4]:
from autochecklist import pipeline

# Create a TICK pipeline
pipe = pipeline("tick", generator_model=MODEL, scorer_model=MODEL)

# Evaluate a response in one line
result = pipe(
    input="Write a haiku about autumn",
    target="Leaves fall gently down\nCrisp air whispers through the trees\nNature's final bow"
)

print(f"Checklist ({len(result.checklist)} items):")
print(result.checklist.to_text())
print(f"\nPass rate: {result.pass_rate:.0%}")

Checklist (4 items):
1. Is the response written as a three-line poem?
2. Does the poem follow a 5-7-5 syllable structure across the three lines?
3. Is the poem explicitly about autumn (mentions autumn or clearly references autumnal elements such as falling leaves, chill, harvest, migration)?
4. Does the poem use concrete sensory imagery (sight, sound, smell, touch, or taste) to evoke the season?

Pass rate: 100%


In [5]:
# Generate-only mode (no scoring)
result = pipe(input="Write a limerick")

print("Generated checklist (no scoring):")
print(result.checklist.to_text())
print(f"\nScore: {result.score}  # None since no response provided")

Generated checklist (no scoring):
1. Is the response written as a poem?
2. Does the poem consist of five lines?
3. Do lines 1, 2, and 5 rhyme with each other and lines 3 and 4 rhyme with each other (AABBA rhyme scheme)?
4. Does the poem exhibit a limerick-like rhythm (longer meter in lines 1, 2, and 5 and shorter meter in lines 3 and 4)?
5. Does the poem include a humorous, whimsical, or surprising twist typical of limericks?

Score: None  # None since no response provided


---
## 3. Pipeline with Custom Components

For more control, use `ChecklistPipeline` directly with custom components.

You can also load custom prompts from `.md` files by registering them first:
```python
from autochecklist import register_custom_generator, register_custom_scorer

# Custom generator prompt
register_custom_generator("my_eval", "my_gen.md")
pipe = pipeline("my_eval", generator_model=MODEL, scorer_model=MODEL)

# Custom scorer prompt (works with any generator)
register_custom_scorer("my_scorer", "my_scorer.md")
pipe = pipeline("tick", generator_model=MODEL, scorer="my_scorer")

# Custom scorer with config kwargs (mode, primary_metric, etc.)
register_custom_scorer("my_weighted", "my_scorer.md", mode="item", primary_metric="weighted")
pipe = pipeline("tick", generator_model=MODEL, scorer="my_weighted")
```

Or register a full custom pipeline with scorer config in one call:
```python
from autochecklist import register_custom_pipeline

register_custom_pipeline(
    "my_eval",
    generator_prompt="Generate yes/no questions for:\n\n{input}",
    scorer_mode="item",
    primary_metric="weighted",
    scorer_prompt="rlcf",  # built-in name, inline text, or Path to file
)
pipe = pipeline("my_eval", generator_model=MODEL, scorer_model=MODEL)
```

See `custom_components_tutorial.ipynb` for details and full examples.

In [6]:
from autochecklist import ChecklistPipeline, DirectGenerator, ChecklistScorer

# Create pipeline with explicit components
pipe = ChecklistPipeline(
    generator=DirectGenerator(method_name="rlcf_direct", model=MODEL),
    scorer=ChecklistScorer(mode="item", primary_metric="weighted", model=MODEL),
)

# RLCF direct requires a reference response
input_text = "Write a factorial function in Python"
reference = """def factorial(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    return 1 if n <= 1 else n * factorial(n-1)"""

result = pipe(
    input=input_text,
    reference=reference,
    target="def fact(x): return x * fact(x-1) if x > 1 else 1"
)

print(f"Weighted score: {result.score.weighted_score:.0%}")
print(f"\nChecklist items with weights:")
for item in result.checklist.items:
    print(f"  [{item.weight:3.0f}] {item.question}")

Weighted score: 29%

Checklist items with weights:
  [100] Is the provided code valid Python syntax (i.e., can be parsed without syntax errors)?
  [100] Does the code define a callable named 'factorial'?
  [ 80] Does the 'factorial' function accept exactly one parameter?
  [100] Does calling factorial(0) return 1?
  [ 90] Does calling factorial(1) return 1?
  [ 95] Does calling factorial(5) return 120?
  [ 60] For a negative integer input (e.g., -1), does the function raise an error or otherwise explicitly refuse to compute a factorial rather than returning an incorrect numeric value?


---
## 4. Batch Corpus Evaluation

Evaluate entire datasets efficiently with `run_batch()`.

In [7]:
# Prepare a small dataset
data = [
    {
        "input": "Write a haiku about spring",
        "target": "Cherry blossoms bloom\nSoft petals drift on the breeze\nNew life awakens"
    },
    {
        "input": "Write a haiku about summer",
        "target": "Golden sun beats down\nCicadas sing in the heat\nLazy days drift by"
    },
    {
        "input": "Write a haiku about winter", 
        "target": "I like winter because it snows and I can make snowmen."  # Not a haiku!
    },
]

pipe = pipeline("tick", generator_model=MODEL, scorer_model=MODEL)

# Run batch evaluation (generates one checklist per instruction)
result = pipe.run_batch(data, show_progress=True)

print(f"\nResults:")
print(f"  Macro-average pass rate: {result.macro_pass_rate:.0%}")
print(f"  Micro-average pass rate (i.e., DRFR): {result.micro_pass_rate:.0%}")
print(f"\nPer-item scores:")
for i, (item, score) in enumerate(zip(data, result.scores)):
    print(f"  {i+1}. {item['input'][:30]:30} -> {score.pass_rate:.0%}")

Generating:   0%|          | 0/3 [00:00<?, ?it/s]




Scoring:   0%|          | 0/3 [00:00<?, ?it/s]

[A

Generating:  33%|███▎      | 1/3 [00:19<00:38, 19.26s/it]




Scoring:  33%|███▎      | 1/3 [00:29<00:58, 29.11s/it]

[A

Generating:  67%|██████▋   | 2/3 [00:45<00:23, 23.11s/it]




Scoring:  67%|██████▋   | 2/3 [00:56<00:28, 28.33s/it]

[A

Generating: 100%|██████████| 3/3 [01:09<00:00, 23.58s/it]




Scoring: 100%|██████████| 3/3 [01:29<00:00, 30.11s/it]

[A

Generating: 100%|██████████| 3/3 [01:29<00:00, 29.71s/it]


Scoring: 100%|██████████| 3/3 [01:29<00:00, 29.71s/it]


Results:
  Macro-average pass rate: 75%
  Micro-average pass rate (i.e., DRFR): 77%

Per-item scores:
  1. Write a haiku about spring     -> 100%
  2. Write a haiku about summer     -> 100%
  3. Write a haiku about winter     -> 25%





In [8]:
# Use a shared checklist for all evaluations
# Useful when you want to apply the same criteria to all responses

# First generate a shared checklist
shared_checklist = pipe.generate(input="Write a haiku about any season")
print("Shared checklist:")
print(shared_checklist.to_text())

# Then evaluate all responses against it
result = pipe.run_batch(data, checklist=shared_checklist)

print(f"\nWith shared checklist:")
print(f"  Macro-average pass rate: {result.macro_pass_rate:.0%}")

Shared checklist:
1. Is the response written in haiku form (three short lines)?
2. Does the haiku adhere to a 5-7-5 syllable structure?
3. Is the haiku explicitly about a season (mentions a season or uses clear seasonal imagery)?
4. Does the haiku use concise, sensory imagery rather than exposition or explanation?



With shared checklist:
  Macro-average pass rate: 75%


---
## 5. Export Results

In [9]:
# Export to JSONL
result.to_jsonl("evaluation_results.jsonl")
print("Saved to evaluation_results.jsonl")

# View the file
!head -2 evaluation_results.jsonl

Saved to evaluation_results.jsonl


{"input": "Write a haiku about spring", "target": "Cherry blossoms bloom\nSoft petals drift on the breeze\nNew life awakens", "pass_rate": 1.0, "item_scores": [{"item_id": "73e81fe4", "answer": "yes", "reasoning": null}, {"item_id": "14f4e2b8", "answer": "yes", "reasoning": null}, {"item_id": "b06fde4e", "answer": "yes", "reasoning": null}, {"item_id": "1f817be1", "answer": "yes", "reasoning": null}], "weighted_score": 1.0, "normalized_score": 1.0}
{"input": "Write a haiku about summer", "target": "Golden sun beats down\nCicadas sing in the heat\nLazy days drift by", "pass_rate": 1.0, "item_scores": [{"item_id": "73e81fe4", "answer": "yes", "reasoning": null}, {"item_id": "14f4e2b8", "answer": "yes", "reasoning": null}, {"item_id": "b06fde4e", "answer": "yes", "reasoning": null}, {"item_id": "1f817be1", "answer": "yes", "reasoning": null}], "weighted_score": 1.0, "normalized_score": 1.0}


In [10]:
# Export to pandas DataFrame (requires pandas)
try:
    df = result.to_dataframe()
    display(df[["input", "target", "pass_rate"]])
except ImportError:
    print("Install pandas for DataFrame export: pip install pandas")

Unnamed: 0,input,target,pass_rate
0,Write a haiku about spring,Cherry blossoms bloom\nSoft petals drift on th...,1.0
1,Write a haiku about summer,Golden sun beats down\nCicadas sing in the hea...,1.0
2,Write a haiku about winter,I like winter because it snows and I can make ...,0.25


In [11]:
# Save and load checklists (Pydantic models)
from autochecklist import Checklist

# Save checklist to JSON file
with open("checklist.json", "w") as f:
    f.write(shared_checklist.model_dump_json(indent=2))

print("Saved checklist to checklist.json")

# Load checklist from JSON file
with open("checklist.json") as f:
    loaded_checklist = Checklist.model_validate_json(f.read())

print(f"Loaded checklist with {len(loaded_checklist)} items:")
print(loaded_checklist.to_text())

Saved checklist to checklist.json
Loaded checklist with 4 items:
1. Is the response written in haiku form (three short lines)?
2. Does the haiku adhere to a 5-7-5 syllable structure?
3. Is the haiku explicitly about a season (mentions a season or uses clear seasonal imagery)?
4. Does the haiku use concise, sensory imagery rather than exposition or explanation?


---
## 6. Config-Driven Instantiation

Create pipelines from configuration dicts (useful for UI integration or config files).

In [12]:
from autochecklist import get_generator, get_scorer

# Configuration dict (could come from a config file or UI)
config = {
    "generator": "tick",
    "scorer": "batch",  # or "item" for one-question-per-call
    "model": MODEL,
}

# Create components from config
gen_cls = get_generator(config["generator"])
scorer_cls = get_scorer(config["scorer"])

generator = gen_cls(model=config["model"])
scorer = scorer_cls(model=config["model"])

# Use them
checklist = generator.generate(input="Say hello")
score = scorer.score(checklist, target="Hello there!")

print(f"Generator: {generator.method_name}")
print(f"Scorer: {scorer.scoring_method}")
print(f"Pass rate: {score.pass_rate:.0%}")

  scorer = scorer_cls(model=config["model"])


Generator: tick
Scorer: batch
Pass rate: 100%


---
## Summary

| Feature | Description |
|---------|-------------|
| `pipeline("task")` | Quick pipeline creation (HuggingFace-style) |
| `register_custom_generator("name", "file.md")` | Register a custom `.md` prompt for generation |
| `register_custom_scorer("name", "file.md", mode=..., primary_metric=...)` | Register a custom scorer with config |
| `register_custom_pipeline("name", generator_prompt=..., scorer_mode=..., primary_metric=...)` | Register a full custom pipeline preset |
| `save_pipeline_config("name", "file.json")` / `load_pipeline_config("file.json")` | Save/load pipeline configs as JSON |
| `pipeline("name", scorer="scorer_name")` | Use registered custom generator + scorer |
| `ChecklistPipeline` | Full control with custom components |
| `list_generators()` | Discover available generators |
| `get_generator(name)` | Get generator class by name |
| `run_batch(data)` | Batch evaluation with metrics |
| `to_dataframe()` | Export results to pandas |
| `to_jsonl(path)` | Export results to JSONL |
| `checklist.model_dump_json()` | Save checklist to JSON |
| `Checklist.model_validate_json()` | Load checklist from JSON |

See `custom_components_tutorial.ipynb` for a full walkthrough of custom `.md` prompts, registration, and advanced subclassing.

---
## 7. Multi-Provider Support

The pipeline API supports multiple LLM providers. All components accept `provider`, `base_url`, `client`, and `api_format` parameters.

In [13]:
from autochecklist import pipeline, VLLMOfflineClient

# OpenRouter (default)
pipe = pipeline("tick", generator_model="openai/gpt-4o-mini", scorer_model="openai/gpt-4o-mini")

# OpenAI direct — lower latency, better caching
pipe = pipeline("tick", provider="openai", generator_model="gpt-4o-mini")

# vLLM server mode — self-hosted, OpenAI-compatible
pipe = pipeline("tick", provider="vllm", base_url="http://gpu-server:8000/v1")

# vLLM offline — direct Python inference, no server needed
# client = VLLMOfflineClient(model="google/gemma-3-1b-it")
# pipe = pipeline("tick", client=client)

# OpenAI Responses API — opt-in, better caching on OpenAI's side
pipe = pipeline("tick", provider="openai", generator_model="gpt-4o-mini", api_format="responses")

# Provider params propagate to all sub-components (generator + scorer)
pipe = pipeline("tick", provider="openai", generator_model="gpt-4o-mini")
print(f"Generator provider: {pipe.generator._provider}")
print(f"Scorer provider: {pipe.scorer._provider}")

Generator provider: openai
Scorer provider: openai
