# Corpus-Level Generators Demo

This notebook demonstrates the **corpus-level** checklist generators in `autochecklist`.

Corpus-level generators create **one checklist for an entire dataset/task** (vs. instance-level which creates one per input).

**Methods covered:**
1. **InductiveGenerator** - From feedback comments with refinement pipeline
2. **DeductiveGenerator** (CheckEval) - From dimension definitions with augmentation modes and optional filtering
3. **InteractiveGenerator** (InteractEval) - From think-aloud attributes via 5-stage pipeline

## Setup

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# OpenRouter is the default provider. For other providers, see pipeline_demo.ipynb
# Supported: OpenRouter, OpenAI (direct), vLLM (server or offline)
assert os.getenv("OPENROUTER_API_KEY"), "Please set OPENROUTER_API_KEY in .env file"

In [2]:
# Import corpus-level generators
from autochecklist import (
    # Corpus-level generators
    InductiveGenerator,
    DeductiveGenerator, AugmentationMode,
    InteractiveGenerator,
    # Input models
    DeductiveInput, InteractiveInput,
    # Scorer (can be used with any checklist)
    ChecklistScorer,
)

print("Corpus-level generators:", ["InductiveGenerator", "DeductiveGenerator", "InteractiveGenerator"])

Corpus-level generators: ['InductiveGenerator', 'DeductiveGenerator', 'InteractiveGenerator']


In [3]:
MODEL = "openai/gpt-5-mini"

---
## 1. InductiveGenerator

**InductiveGenerator** creates checklists from feedback comments (e.g., reviewer feedback, user complaints, quality notes).

**Pipeline:**
1. **Generate**: LLM creates questions from feedback batches
2. **Deduplicate**: Merge semantically similar questions (embedding similarity)
3. **Tag**: Filter by applicability and specificity
4. **Select**: Beam search for diverse subset

**Use case**: You have collected feedback about responses and want to create a checklist that addresses the common issues.

In [4]:
# Sample feedback comments about technical documentation
feedback = [
    "The explanation was too long and hard to follow.",
    "Good use of examples to illustrate concepts.",
    "Missing important details about edge cases.",
    "The code snippets were helpful and clear.",
    "Structure was confusing - hard to find what I needed.",
    "Excellent step-by-step instructions.",
    "Some technical terms weren't explained.",
    "The documentation was outdated in some sections.",
    "Would benefit from more visual diagrams.",
    "The examples don't cover all use cases.",
]

print(f"Collected {len(feedback)} feedback comments:")
for i, f in enumerate(feedback, 1):
    print(f"  {i}. {f}")

Collected 10 feedback comments:
  1. The explanation was too long and hard to follow.
  2. Good use of examples to illustrate concepts.
  3. Missing important details about edge cases.
  4. The code snippets were helpful and clear.
  5. Structure was confusing - hard to find what I needed.
  6. Excellent step-by-step instructions.
  7. Some technical terms weren't explained.
  8. The documentation was outdated in some sections.
  9. Would benefit from more visual diagrams.
  10. The examples don't cover all use cases.


In [5]:
# Generate checklist from feedback
feedback_gen = InductiveGenerator(
    model=MODEL,
    dedup_threshold=0.85,  # Merge questions with >85% similarity
    max_questions=10,      # Limit final checklist size
)

checklist = feedback_gen.generate(
    observations=feedback,
    domain="technical documentation",
    skip_selection=True,
    skip_tagging=True,
    verbose=True,
)

print(f"Generated {len(checklist.items)} questions from {len(feedback)} feedback comments:")
print()
for i, item in enumerate(checklist.items, 1):
    print(f"  {i}. {item.question}")

[InductiveGenerator] Generated 7 raw questions from 10 observations


[InductiveGenerator] Deduplication: 7 → 7 questions (0 clusters merged)
[InductiveGenerator] Final checklist: 7 questions
Generated 7 questions from 10 feedback comments:

  1. Is the content concise, well-organized, and easy to follow so readers can quickly find what they need?
  2. Are examples provided that clearly illustrate the concepts?
  3. Do the examples and coverage include common and important use cases and edge cases?
  4. Are any code snippets included clear, correct, and directly usable by readers?
  5. Are technical terms and jargon defined or linked to clear explanations?
  6. Is the documentation up-to-date and accurate for current versions, APIs, or standards?
  7. Does the documentation include helpful diagrams or visuals where they improve understanding?


In [6]:
# Check metadata
print("Checklist metadata:")
print(f"  Source method: {checklist.source_method}")
print(f"  Generation level: {checklist.generation_level}")
print(f"  Observation count: {checklist.metadata.get('observation_count')}")

Checklist metadata:
  Source method: feedback
  Generation level: corpus
  Observation count: 10


In [7]:
# Use the checklist to score documentation
scorer = ChecklistScorer(mode="batch", model=MODEL)

sample_doc = """
# How to Use the API

This guide explains how to make API calls.

## Step 1: Authentication
First, get your API key from the dashboard.

## Step 2: Make a Request
```python
response = client.get('/users')
```

The response contains user data.
"""

score = scorer.score(checklist, target=sample_doc)

print(f"Documentation score: {score.pass_rate:.0%}")
print("\nItem breakdown:")
for item, item_score in zip(checklist.items, score.item_scores):
    print(f"  [{item_score.answer.name:3}] {item.question[:60]}...")

Documentation score: 0%

Item breakdown:
  [NO ] Is the content concise, well-organized, and easy to follow s...
  [NO ] Are examples provided that clearly illustrate the concepts?...
  [NO ] Do the examples and coverage include common and important us...
  [NO ] Are any code snippets included clear, correct, and directly ...
  [NO ] Are technical terms and jargon defined or linked to clear ex...
  [NO ] Is the documentation up-to-date and accurate for current ver...
  [NO ] Does the documentation include helpful diagrams or visuals w...


---
## 2. DeductiveGenerator (CheckEval)

**DeductiveGenerator** generates checklists from evaluation dimension definitions.

**Four augmentation modes:**
- **seed**: Minimal questions (1-3 per sub-dimension)
- **elaboration**: Detailed questions with more specificity (5+ per sub-dimension)
- **diversification**: Alternative framings for each criterion
- **combined**: Paper-faithful pipeline — generates seeds, then runs both elaboration and diversification, then merges all three pools

**Use case**: You have defined evaluation criteria (dimensions) and want to generate binary yes/no questions.

In [8]:
# Define evaluation dimensions
dimensions = [
    DeductiveInput(
        name="coherence",
        definition="The response should maintain logical flow and consistency throughout.",
        sub_dimensions=["Logical Flow", "Consistency", "Organization"],
    ),
    DeductiveInput(
        name="clarity",
        definition="The response should be clear and easy to understand.",
        sub_dimensions=["Language Clarity", "Structure"],
    ),
]

print("Evaluation dimensions:")
for dim in dimensions:
    print(f"\n  {dim.name.upper()}")
    print(f"    Definition: {dim.definition}")
    print(f"    Sub-dimensions: {dim.sub_dimensions}")

Evaluation dimensions:

  COHERENCE
    Definition: The response should maintain logical flow and consistency throughout.
    Sub-dimensions: ['Logical Flow', 'Consistency', 'Organization']

  CLARITY
    Definition: The response should be clear and easy to understand.
    Sub-dimensions: ['Language Clarity', 'Structure']


In [9]:
# Generate with SEED mode (minimal questions)
checkeval_seed = DeductiveGenerator(
    model=MODEL,
    augmentation_mode=AugmentationMode.SEED,
)

seed_checklist = checkeval_seed.generate(dimensions=dimensions)

print(f"SEED mode: {len(seed_checklist.items)} questions")
print()
for item in seed_checklist.items:
    print(f"  [{item.category}] {item.question}")

SEED mode: 10 questions

  [coherence] Does the response present ideas in a clear, progressive sequence?
  [coherence] Are transitions between points smooth and easy to follow?
  [coherence] Is the information presented throughout the response free of contradictions?
  [coherence] Does the response use consistent terminology and facts across all sections?
  [coherence] Is the response structured with a clear introduction, body, and conclusion?
  [coherence] Can the reader quickly identify the main point or purpose of the response?
  [clarity] Are the sentences grammatically correct?
  [clarity] Is the vocabulary and terminology appropriate and understandable for the intended audience?
  [clarity] Is the information organized in a logical sequence?
  [clarity] Does each paragraph or section focus on a single main idea?


In [10]:
# Generate with ELABORATION mode (more detailed questions)
checkeval_elab = DeductiveGenerator(
    model=MODEL,
    augmentation_mode=AugmentationMode.ELABORATION,
)

elab_checklist = checkeval_elab.generate(dimensions=dimensions)

print(f"ELABORATION mode: {len(elab_checklist.items)} questions")
print()
print("Sample questions:")
for item in elab_checklist.items[:6]:
    print(f"  [{item.category}] {item.question}")
if len(elab_checklist.items) > 6:
    print(f"  ... and {len(elab_checklist.items) - 6} more")

ELABORATION mode: 25 questions

Sample questions:
  [coherence] Does the response present ideas in a clear, sequential order?
  [coherence] Is each sentence or paragraph logically connected to the one before it?
  [coherence] Are transitions used to link ideas between sentences or paragraphs?
  [coherence] Can the reader trace a clear progression from the introduction to the conclusion?
  [coherence] Has the response avoided introducing unrelated points that disrupt the flow?
  [coherence] Does the response use the same terms consistently for the same concepts?
  ... and 19 more


In [11]:
# Generate with DIVERSIFICATION mode (alternative framings)
checkeval_div = DeductiveGenerator(
    model=MODEL,
    augmentation_mode=AugmentationMode.DIVERSIFICATION,
)

div_checklist = checkeval_div.generate(dimensions=dimensions)

print(f"DIVERSIFICATION mode: {len(div_checklist.items)} questions")
print()
print("Sample questions:")
for item in div_checklist.items[:6]:
    print(f"  [{item.category}] {item.question}")

DIVERSIFICATION mode: 20 questions

Sample questions:
  [coherence] Does the response present ideas in a clear, progressive sequence?
  [coherence] Are transitions between sentences and paragraphs smooth and connecting ideas?
  [coherence] Can the reader follow the argument without encountering abrupt, unexplained jumps?
  [coherence] Does each sentence logically follow from the previous one?
  [coherence] Is the information presented internally consistent without contradictions?
  [coherence] Does the response use the same factual claims throughout without changing them?


In [12]:
# Compare modes
print("Comparison of augmentation modes:")
print(f"  SEED:            {len(seed_checklist.items)} questions")
print(f"  ELABORATION:     {len(elab_checklist.items)} questions")
print(f"  DIVERSIFICATION: {len(div_checklist.items)} questions")
print(f"  COMBINED:        (see next cell)")

Comparison of augmentation modes:
  SEED:            10 questions
  ELABORATION:     25 questions
  DIVERSIFICATION: 20 questions
  COMBINED:        (see next cell)


In [13]:
# Generate with COMBINED mode (paper-faithful: seed → elaborate + diversify → merge)
checkeval_combined = DeductiveGenerator(
    model=MODEL,
    augmentation_mode=AugmentationMode.COMBINED,
)

combined_checklist = checkeval_combined.generate(dimensions=dimensions, verbose=True)

print(f"\nCOMBINED mode: {len(combined_checklist.items)} questions")
print()
print("Sample questions (with augmentation source):")
for item in combined_checklist.items[:8]:
    source = item.metadata.get("augmentation_source", "seed") if item.metadata else "seed"
    print(f"  [{item.category}] ({source}) {item.question}")
if len(combined_checklist.items) > 8:
    print(f"  ... and {len(combined_checklist.items) - 8} more")

[DeductiveGenerator] Generated 61 questions from 2 dimensions
[DeductiveGenerator] Final checklist: 61 questions

COMBINED mode: 61 questions

Sample questions (with augmentation source):
  [coherence] (seed) Does the response present points in an order that builds logically on previous points?
  [coherence] (seed) Is each paragraph or sentence connected to the previous one without abrupt topic jumps?
  [coherence] (seed) Does the response use key terms and facts consistently without contradicting earlier statements?
  [coherence] (seed) Is the author's stance or main conclusion maintained consistently throughout the response?
  [coherence] (seed) Has the response been organized into clear sections or distinct paragraphs that separate ideas?
  [coherence] (seed) Can the main idea or purpose of the response be identified quickly from its structure?
  [coherence] (elaboration) Does the response use explicit transitional phrases to signal shifts or links between points?
  [coherence] (ela

### DeductiveGenerator with Filtering Pipeline

DeductiveGenerator can optionally apply the CheckEval paper's **3-stage filtering pipeline**:
1. **Alignment**: Ensures "yes" indicates higher quality
2. **Dimension Consistency**: Questions match dimension definitions  
3. **Redundancy Removal**: Eliminates overlapping questions

Use `apply_filtering=True` to enable, and `verbose=True` to see progress.

In [14]:
# Generate with filtering enabled and verbose output
checkeval_filtered = DeductiveGenerator(
    model=MODEL,
    augmentation_mode=AugmentationMode.ELABORATION,
)

# Use dimensions from earlier
filtered_checklist = checkeval_filtered.generate(
    dimensions=dimensions,
    # apply_filtering=True,  # Enable the 3-stage filtering pipeline
    verbose=True,          # Print progress at each stage
)

print(f"\nFinal filtered checklist: {len(filtered_checklist.items)} questions")
print("\nSample questions after filtering:")
for item in filtered_checklist.items[:5]:
    print(f"  [{item.category}] {item.question}")

[DeductiveGenerator] Generated 25 questions from 2 dimensions
[DeductiveGenerator] Final checklist: 25 questions

Final filtered checklist: 25 questions

Sample questions after filtering:
  [coherence] Does the response progress in a clear, logical order from one idea to the next?
  [coherence] Is each sentence or paragraph linked to the previous one without abrupt jumps?
  [coherence] Does the response present premises that directly support its main conclusion?
  [coherence] Can a reader follow the sequence of ideas without needing to re-read earlier parts?
  [coherence] Has the response avoided introducing unrelated information that breaks the flow?


In [15]:
# Use pre-defined seed questions (skip LLM generation)
seed_questions = {
    "coherence": {
        "Logical Flow": ["Does the response flow logically?"],
        "Consistency": ["Is the response internally consistent?"],
        "Organization": ["Is the response well-organized?"],
    },
    "clarity": {
        "Language Clarity": ["Is the language clear and unambiguous?"],
        "Structure": ["Is the structure easy to follow?"],
    },
}

checkeval = DeductiveGenerator(model=MODEL)
custom_checklist = checkeval.generate(
    dimensions=dimensions,
    seed_questions=seed_questions,
    augment=False,  # Use seed questions directly, no LLM augmentation
)

print(f"Custom seed questions (no LLM): {len(custom_checklist.items)} items")
for item in custom_checklist.items:
    print(f"  - {item.question}")

Custom seed questions (no LLM): 5 items
  - Does the response flow logically?
  - Is the response internally consistent?
  - Is the response well-organized?
  - Is the language clear and unambiguous?
  - Is the structure easy to follow?


---
## 3. InteractiveGenerator (InteractEval)

**InteractiveGenerator** generates checklists from think-aloud attributes through a **5-stage pipeline**:

1. **Component Extraction**: Find recurring themes from attributes (max 5)
2. **Attributes Clustering**: Group attributes under components
3. **Key Question Generation**: 1 yes/no question per component
4. **Sub-Question Generation**: 2-3 sub-questions per key question
5. **Question Validation**: Refine and validate final checklist

**Think-aloud attributes** come from human evaluators or LLMs reading samples with rubrics and noting what they look for.

**Use case**: You have evaluation considerations from human/LLM think-aloud sessions and want to structure them into a checklist.

In [16]:
# Think-aloud attributes from evaluation sessions
# These are considerations that evaluators noted when assessing summaries
attributes = InteractiveInput(
    source="human_llm",  # Combined from human and LLM sources
    dimension="coherence",
    attributes=[
        "Check if the summary maintains logical flow between sentences",
        "Ensure ideas connect smoothly without abrupt transitions",
        "Assess if the structure is well-organized with clear sections",
        "Verify the summary avoids jumping between unrelated topics",
        "Look for consistent point of view throughout the text",
        "Check that main points are clearly and concisely presented",
        "Ensure the summary builds to a coherent conclusion",
        "Verify there is a clear beginning, middle, and end",
        "Check for effective use of transition words",
        "Assess if the summary maintains a consistent narrative voice",
    ],
    sample_context="Summary evaluation for news articles",
)

print(f"Think-aloud attributes ({len(attributes.attributes)} items):")
print(f"  Source: {attributes.source}")
print(f"  Dimension: {attributes.dimension}")
print()
for attr in attributes.attributes:
    print(f"  - {attr}")

Think-aloud attributes (10 items):
  Source: human_llm
  Dimension: coherence

  - Check if the summary maintains logical flow between sentences
  - Ensure ideas connect smoothly without abrupt transitions
  - Assess if the structure is well-organized with clear sections
  - Verify the summary avoids jumping between unrelated topics
  - Look for consistent point of view throughout the text
  - Check that main points are clearly and concisely presented
  - Ensure the summary builds to a coherent conclusion
  - Verify there is a clear beginning, middle, and end
  - Check for effective use of transition words
  - Assess if the summary maintains a consistent narrative voice


In [17]:
# Define the rubric for the dimension
rubric = """
Coherence measures the collective quality of all sentences. 
The summary should be well-structured and well-organized. 
The summary should not just be a heap of related information, 
but should build from sentence to sentence to a coherent body of information about a topic.
"""

print("Rubric:")
print(rubric)

Rubric:

Coherence measures the collective quality of all sentences. 
The summary should be well-structured and well-organized. 
The summary should not just be a heap of related information, 
but should build from sentence to sentence to a coherent body of information about a topic.



In [18]:
# Generate checklist via 5-stage pipeline
interacteval = InteractiveGenerator(
    model=MODEL,
    max_components=4,  # Limit to 4 main themes
)

checklist = interacteval.generate(
    inputs=[attributes],
    rubric=rubric,
)

print(f"Generated {len(checklist.items)} questions from {len(attributes.attributes)} attributes:")
print()
for i, item in enumerate(checklist.items, 1):
    print(f"  {i}. {item.question}")

Generated 13 questions from 10 attributes:

  1. Is there a clear introductory sentence that establishes the topic and purpose?
  2. Does the text maintain a single main topic from start to finish?
  3. Are paragraphs organized so each focuses on a single main point?
  4. Do paragraphs appear in a logical order (for example: context → key points → conclusion)?
  5. Do sentences within each paragraph build logically on the previous sentence?
  6. Are clear topic sentences or section cues used to signal each paragraph's main point?
  7. Are transition words or phrases used to link sentences and paragraphs where needed?
  8. Do pronouns and other referential terms clearly and consistently refer back to earlier concepts?
  9. Are key terms or repeated words used consistently to reinforce connections between ideas?
  10. Are individual sentences clearly relevant to advancing or supporting the main topic (no off-topic content)?
  11. Does the final sentence provide a concise conclusion or cl

In [19]:
# Check metadata
print("Checklist metadata:")
print(f"  Source method: {checklist.source_method}")
print(f"  Generation level: {checklist.generation_level}")
print(f"  Dimension: {checklist.metadata.get('dimension')}")
print(f"  Attribute count: {checklist.metadata.get('attribute_count')}")
print(f"  Source: {checklist.metadata.get('source')}")

Checklist metadata:
  Source method: interacteval
  Generation level: corpus
  Dimension: coherence
  Attribute count: 10
  Source: human_llm


In [20]:
# Combine multiple sources (human + LLM)
# In practice, you might have separate think-aloud sessions from humans and LLMs

human_attrs = InteractiveInput(
    source="human",
    dimension="coherence",
    attributes=[
        "Does it make sense as a whole?",
        "Can I follow the argument easily?",
        "Are the ideas connected?",
    ],
)

llm_attrs = InteractiveInput(
    source="llm",
    dimension="coherence",
    attributes=[
        "Check for logical consistency between statements",
        "Verify smooth transitions between paragraphs",
        "Assess narrative structure and flow",
    ],
)

combined_checklist = interacteval.generate(
    inputs=[human_attrs, llm_attrs],  # Combine both sources
    rubric=rubric,
)

print(f"Combined checklist from human + LLM sources:")
print(f"  Total attributes: {combined_checklist.metadata.get('attribute_count')}")
print(f"  Source: {combined_checklist.metadata.get('source')}")
print(f"  Questions generated: {len(combined_checklist.items)}")
print()
for i, item in enumerate(combined_checklist.items, 1):
    print(f"  {i}. {item.question}")

Combined checklist from human + LLM sources:
  Total attributes: 6
  Source: human_llm
  Questions generated: 11

  1. Does the text focus on a single clear main idea or thesis?
  2. Is there a clear opening or topic sentence that frames the whole summary?
  3. Are all sentences directly relevant to and supportive of that main idea?
  4. Do sentences and paragraphs build logically so each one advances the prior content?
  5. Is terminology and concept usage consistent across sentences?
  6. Are factual statements and claims internally consistent with one another?
  7. Are causal and comparative relationships presented clearly and without contradiction?
  8. Does the overall sequence follow a clear organizational pattern (for example: background → details → conclusion)?
  9. Are explicit transitions or referencing used to signal relationships and maintain continuity between sentences and paragraphs?
  10. Are paragraph breaks placed where a new subtopic or logical step begins and do the

In [21]:
# Use the checklist to score a summary
scorer = ChecklistScorer(mode="batch", model=MODEL)

good_summary = """
The city council approved a new budget yesterday, allocating increased funds 
for public transportation and education. The decision follows months of public 
consultation and reflects community priorities. Mayor Johnson praised the 
collaborative process, noting that the budget addresses both immediate needs 
and long-term infrastructure goals. Implementation begins next fiscal year.
"""

score = scorer.score(checklist, target=good_summary)

print("Summary:")
print(good_summary)
print(f"\nCoherence score: {score.pass_rate:.0%}")

Summary:

The city council approved a new budget yesterday, allocating increased funds 
for public transportation and education. The decision follows months of public 
consultation and reflects community priorities. Mayor Johnson praised the 
collaborative process, noting that the budget addresses both immediate needs 
and long-term infrastructure goals. Implementation begins next fiscal year.


Coherence score: 100%


---
## Summary

| Generator | Input | Pipeline | Use Case |
|-----------|-------|----------|----------|
| **InductiveGenerator** | Feedback comments | Generate → Deduplicate → Tag → Select | From user/reviewer feedback |
| **DeductiveGenerator** | Dimension definitions | LLM generation with augmentation modes | From evaluation criteria |
| **InteractiveGenerator** | Think-aloud attributes | 5-stage refinement pipeline | From evaluation sessions |

### Key Differences

- **InductiveGenerator**: Best when you have unstructured feedback and want to synthesize it into evaluation criteria
- **DeductiveGenerator**: Best when you have well-defined dimensions and want to expand them into detailed questions
- **InteractiveGenerator**: Best when you have evaluation considerations from think-aloud sessions and want to organize them

### All checklists can be used with any scorer config

Once generated, corpus-level checklists work with any `ChecklistScorer` configuration (`mode="batch"`, `mode="item"`, weighted, normalized) just like instance-level checklists.