# autochecklist Demo

This notebook demonstrates the checklist generation and scoring methods implemented in `autochecklist`.

**Methods covered:**
1. **TICK pipeline** (DirectGenerator) + ChecklistScorer (batch mode) - Simple checklist from input
2. **RLCF pipelines** (DirectGenerator/ContrastiveGenerator) + ChecklistScorer (weighted mode) - Weighted items with importance scores
3. **RocketEval pipeline** (DirectGenerator) + ChecklistScorer (logprobs mode) - Confidence-aware scoring

## Setup

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# OpenRouter is the default provider. For other providers, see pipeline_demo.ipynb
# Supported: OpenRouter, OpenAI (direct), vLLM (server or offline)
assert os.getenv("OPENROUTER_API_KEY"), "Please set OPENROUTER_API_KEY in .env file"

In [2]:
# Import the package
from autochecklist import (
    # Generators
    DirectGenerator, ContrastiveGenerator,
    # Scorer
    ChecklistScorer,
    # Models
    Checklist, Score, ConfidenceLevel,
    # Pipeline (recommended way)
    pipeline,
)

print("Instance-level generators: DirectGenerator (tick, rocketeval, rlcf_direct), ContrastiveGenerator (rlcf_candidate)")
print("ChecklistScorer modes: batch, item (with capture_reasoning, use_logprobs, primary_metric options)")

Instance-level generators: DirectGenerator (tick, rocketeval, rlcf_direct), ContrastiveGenerator (rlcf_candidate)
ChecklistScorer modes: batch, item (with capture_reasoning, use_logprobs, primary_metric options)


In [3]:
MODEL = "openai/gpt-5-mini"

---
## 1. TICK pipeline + ChecklistScorer (batch mode)

**TICK pipeline** generates a checklist from just an input using few-shot prompting via `DirectGenerator`.

**ChecklistScorer (batch mode)** evaluates all items in a single LLM call and returns the pass rate.

In [4]:
# Initialize TICK generator via pipeline preset
tick = DirectGenerator(method_name="tick", model=MODEL)

# Generate checklist from input
input_text = "Write a haiku about autumn leaves falling."
checklist = tick.generate(input=input_text)

print(f"Input: {input_text}")
print(f"\nGenerated {len(checklist.items)} checklist items:")
print(checklist.to_text())

Input: Write a haiku about autumn leaves falling.

Generated 4 checklist items:
1. Is the response written as a haiku (presented in three lines)?
2. Does the haiku follow the traditional 5-7-5 syllable pattern?
3. Is the subject of the haiku clearly about autumn leaves falling?
4. Does the response contain only the haiku (no additional unrelated explanation or text)?


In [5]:
# Score a GOOD response
scorer = ChecklistScorer(mode="batch", model=MODEL)

good_response = """Crimson leaves drift down
Dancing on the autumn breeze
Nature's last ballet"""

score = scorer.score(checklist, target=good_response, input=input_text)

print("GOOD Response:")
print(good_response)
print(f"\nPass rate: {score.pass_rate:.0%}")
print("\nItem scores:")
for item, item_score in zip(checklist.items, score.item_scores):
    print(f"  [{item_score.answer}] {item.question}")

GOOD Response:
Crimson leaves drift down
Dancing on the autumn breeze
Nature's last ballet

Pass rate: 100%

Item scores:
  [ChecklistItemAnswer.YES] Is the response written as a haiku (presented in three lines)?
  [ChecklistItemAnswer.YES] Does the haiku follow the traditional 5-7-5 syllable pattern?
  [ChecklistItemAnswer.YES] Is the subject of the haiku clearly about autumn leaves falling?
  [ChecklistItemAnswer.YES] Does the response contain only the haiku (no additional unrelated explanation or text)?


In [6]:
# Score a BAD response
bad_response = "Autumn is nice. I like the colors. The weather is cool."

score = scorer.score(checklist, target=bad_response, input=input_text)

print("BAD Response:")
print(bad_response)
print(f"\nPass rate: {score.pass_rate:.0%}")
print("\nItem scores:")
for item, item_score in zip(checklist.items, score.item_scores):
    print(f"  [{item_score.answer}] {item.question}")

BAD Response:
Autumn is nice. I like the colors. The weather is cool.

Pass rate: 0%

Item scores:
  [ChecklistItemAnswer.NO] Is the response written as a haiku (presented in three lines)?
  [ChecklistItemAnswer.NO] Does the haiku follow the traditional 5-7-5 syllable pattern?
  [ChecklistItemAnswer.NO] Is the subject of the haiku clearly about autumn leaves falling?
  [ChecklistItemAnswer.NO] Does the response contain only the haiku (no additional unrelated explanation or text)?


### ChecklistScorer (item mode with reasoning)

**ChecklistScorer with `mode="item"` and `capture_reasoning=True`** evaluates each checklist item individually (one LLM call per question).

This mirrors the original TICK paper's evaluation methodology. Useful when:
- You want more accurate per-item judgments
- You want individual analysis reasoning for each question

In [7]:
# ChecklistScorer (item mode): One LLM call per question (original TICK methodology)
item_scorer = ChecklistScorer(mode="item", capture_reasoning=True, model=MODEL, temperature=0.0)

# Use the same checklist from earlier
score = item_scorer.score(checklist, target=good_response, input=input_text)

print(f"Item-mode scorer results:")
print(f"  Pass rate: {score.pass_rate:.0%}")
print(f"  LLM calls made: {score.metadata['num_calls']}")
print()
print("Item-by-item judgments with analysis:")
for item, item_score in zip(checklist.items, score.item_scores):
    analysis = item_score.reasoning[:80] + "..." if item_score.reasoning and len(item_score.reasoning) > 80 else item_score.reasoning
    print(f"\n  [{item_score.answer.name:3}] {item.question}")
    if analysis:
        print(f"       → {analysis}")

Item-mode scorer results:
  Pass rate: 100%
  LLM calls made: 4

Item-by-item judgments with analysis:

  [YES] Is the response written as a haiku (presented in three lines)?
       → The generated text is presented in three separate lines: "Crimson leaves drift d...

  [YES] Does the haiku follow the traditional 5-7-5 syllable pattern?
       → Line 1: "Crimson leaves drift down" = crimson (2) + leaves (1) + drift (1) + dow...

  [YES] Is the subject of the haiku clearly about autumn leaves falling?
       → The haiku explicitly names 'Crimson leaves' and describes them 'drift[ing] down,...

  [YES] Does the response contain only the haiku (no additional unrelated explanation or text)?
       → The generated text consists solely of a three-line haiku about autumn leaves fal...


---
## 2. RLCF pipelines + ChecklistScorer (weighted mode)

**RLCF** generates checklists with weighted items (0-100 importance) via `DirectGenerator` (direct mode) or `ContrastiveGenerator` (candidate modes).

Two main modes:
- **rlcf_direct** (DirectGenerator): Uses input + reference response
- **rlcf_candidate** (ContrastiveGenerator): Also analyzes candidate responses to find failure modes
  - Pass explicit candidates, OR
  - Auto-generate candidates using smaller models (recommended, matches original paper)

**ChecklistScorer with `primary_metric="weighted"`** uses item weights for weighted average scoring.

In [8]:
# RLCF Direct Mode (via DirectGenerator)
rlcf = DirectGenerator(method_name="rlcf_direct", model=MODEL, max_tokens=4096)

input_text = "Write a Python function that calculates the factorial of a number."
reference = """def factorial(n):
    if n < 0:
        raise ValueError("Factorial not defined for negative numbers")
    if n <= 1:
        return 1
    return n * factorial(n - 1)"""

checklist = rlcf.generate(input=input_text, reference=reference)

print(f"Input: {input_text}")
print(f"\nGenerated {len(checklist.items)} weighted checklist items:")
for item in checklist.items:
    print(f"  [{item.weight:3.0f}] {item.question}")

Input: Write a Python function that calculates the factorial of a number.

Generated 7 weighted checklist items:
  [100] Is the provided code syntactically valid Python code (would parse/run without syntax errors)?
  [100] Does the response define a Python function (i.e., contains a def statement)?
  [ 85] Does the defined function accept one argument (a parameter for the number whose factorial is to be calculated)?
  [100] For input 0, does the function return 1?
  [ 95] For input 5, does the function return 120?
  [ 90] Does the function return the factorial value (use a return statement) rather than only printing it?
  [ 80] Is the function self-contained (does not reference undefined helper functions or missing imports)?


In [9]:
# Score with ChecklistScorer (weighted mode)
weighted_scorer = ChecklistScorer(mode="item", primary_metric="weighted", model=MODEL)

# Good implementation
good_impl = """def factorial(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    result = 1
    for i in range(2, n + 1):
        result *= i
    return result"""

score = weighted_scorer.score(checklist, target=good_impl, input=input_text)

print("GOOD Implementation:")
print(good_impl)
print(f"\nWeighted score: {score.weighted_score:.0%}")
print(f"Unweighted score: {score.total_score:.0%}")

GOOD Implementation:
def factorial(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    result = 1
    for i in range(2, n + 1):
        result *= i
    return result

Weighted score: 100%
Unweighted score: 100%


In [10]:
# Buggy implementation (missing edge cases)
buggy_impl = """def factorial(n):
    return n * factorial(n - 1)"""

score = weighted_scorer.score(checklist, target=buggy_impl, input=input_text)

print("BUGGY Implementation:")
print(buggy_impl)
print(f"\nWeighted score: {score.weighted_score:.0%}")
print(f"Unweighted score: {score.total_score:.0%}")
print("\nItem scores:")
for item, item_score in zip(checklist.items, score.item_scores):
    print(f"  [{item_score.answer}] (weight={item.weight:3.0f}) {item.question}")

BUGGY Implementation:
def factorial(n):
    return n * factorial(n - 1)

Weighted score: 56%
Unweighted score: 57%

Item scores:
  [ChecklistItemAnswer.YES] (weight=100) Is the provided code syntactically valid Python code (would parse/run without syntax errors)?
  [ChecklistItemAnswer.YES] (weight=100) Does the response define a Python function (i.e., contains a def statement)?
  [ChecklistItemAnswer.YES] (weight= 85) Does the defined function accept one argument (a parameter for the number whose factorial is to be calculated)?
  [ChecklistItemAnswer.NO] (weight=100) For input 0, does the function return 1?
  [ChecklistItemAnswer.NO] (weight= 95) For input 5, does the function return 120?
  [ChecklistItemAnswer.NO] (weight= 90) Does the function return the factorial value (use a return statement) rather than only printing it?
  [ChecklistItemAnswer.YES] (weight= 80) Is the function self-contained (does not reference undefined helper functions or missing imports)?


In [11]:
# RLCF Candidate-Based Mode (finds failure modes from examples)
rlcf_candidate = ContrastiveGenerator(method_name="rlcf_candidate", model=MODEL)

# Candidate responses with various issues
candidates = [
    "def factorial(n): return n * factorial(n-1)",  # No base case
    "def fact(n): return 1 if n == 0 else n * fact(n-1)",  # Wrong function name
    "def factorial(n): return math.factorial(n)",  # Uses library
]

checklist = rlcf_candidate.generate(
    input=input_text,
    reference=reference,
    candidates=candidates
)

print("Candidate-based mode (analyzes failure patterns):")
print(f"Generated {len(checklist.items)} items:")
for item in checklist.items:
    print(f"  [{item.weight:3.0f}] {item.question}")

Candidate-based mode (analyzes failure patterns):
Generated 5 items:
  [100] Does the response contain a valid Python function definition (a def statement with a function name and parameter list)?
  [100] Does the function return 1 when called with 0 (i.e., factorial(0) == 1)?
  [100] Does the function return the correct factorial for a typical positive integer input (e.g., factorial(5) == 120)?
  [ 90] Is the provided code syntactically valid Python as given (parsable and not referencing undefined names or missing required imports)?
  [ 40] Does the function explicitly handle negative inputs (for example by raising an error or documenting a defined behavior)?


In [12]:
# RLCF with Auto-Generated Candidates (Recommended)
# Instead of providing explicit candidates, let ContrastiveGenerator auto-generate them
# This matches the original RLCF paper methodology

rlcf_auto = ContrastiveGenerator(
    method_name="rlcf_candidate",
    model=MODEL,
    candidate_models=["openai/gpt-4o-mini"],  # Smaller model(s) to generate candidates
    num_candidates=2,  # Number of candidates to generate (when using single model)
)

auto_checklist = rlcf_auto.generate(
    input=input_text,
    reference=reference,
    # No candidates argument - they're auto-generated!
)

print("Auto-candidate mode (generates candidates automatically):")
print(f"Using {auto_checklist.metadata.get('num_candidates')} auto-generated candidates\n")

# Print the auto-generated candidates
print("=" * 60)
print("AUTO-GENERATED CANDIDATES:")
print("=" * 60)
for i, candidate in enumerate(auto_checklist.metadata.get("candidates", []), 1):
    print(f"\n--- Candidate {i} ---")
    print(candidate[:300] + "..." if len(candidate) > 300 else candidate)

print("\n" + "=" * 60)
print(f"GENERATED CHECKLIST ({len(auto_checklist.items)} items):")
print("=" * 60)
for item in auto_checklist.items:
    print(f"  [{item.weight:3.0f}] {item.question}")

Auto-candidate mode (generates candidates automatically):
Using 2 auto-generated candidates

AUTO-GENERATED CANDIDATES:

--- Candidate 1 ---
Certainly! Below is a Python function that calculates the factorial of a given non-negative integer using a recursive approach:

```python
def factorial(n):
    if n < 0:
        raise ValueError("Factorial is not defined for negative numbers.")
    elif n == 0 or n == 1:
        return 1
    else:
...

--- Candidate 2 ---
Certainly! Here is a Python function that calculates the factorial of a given non-negative integer using a recursive approach:

```python
def factorial(n):
    """Calculate the factorial of a non-negative integer n."""
    if n < 0:
        raise ValueError("Factorial is not defined for negative num...

GENERATED CHECKLIST (5 items):
  [100] Does the response include syntactically valid Python code that defines a function?
  [ 90] Does the defined function accept exactly one parameter?
  [ 95] Does the function return 1 for inpu

In [13]:
# Score using the auto-generated candidate checklist
print("Scoring with AUTO-CANDIDATE checklist:")
print("=" * 60)

# Score the good implementation
score = weighted_scorer.score(auto_checklist, target=good_impl, input=input_text)
print("\nGOOD Implementation:")
print(good_impl)
print(f"\nWeighted score: {score.weighted_score:.0%}")

# Score the buggy implementation  
score = weighted_scorer.score(auto_checklist, target=buggy_impl, input=input_text)
print("\n" + "-" * 60)
print("\nBUGGY Implementation:")
print(buggy_impl)
print(f"\nWeighted score: {score.weighted_score:.0%}")
print("\nItem breakdown:")
for item, item_score in zip(auto_checklist.items, score.item_scores):
    print(f"  [{item_score.answer}] (weight={item.weight:3.0f}) {item.question}")

Scoring with AUTO-CANDIDATE checklist:



GOOD Implementation:
def factorial(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    result = 1
    for i in range(2, n + 1):
        result *= i
    return result

Weighted score: 100%



------------------------------------------------------------

BUGGY Implementation:
def factorial(n):
    return n * factorial(n - 1)

Weighted score: 44%

Item breakdown:
  [ChecklistItemAnswer.YES] (weight=100) Does the response include syntactically valid Python code that defines a function?
  [ChecklistItemAnswer.YES] (weight= 90) Does the defined function accept exactly one parameter?
  [ChecklistItemAnswer.NO] (weight= 95) Does the function return 1 for input 0 and for input 1?
  [ChecklistItemAnswer.NO] (weight=100) Does the function compute the factorial correctly for a positive integer input (for example, return 120 for input 5)?
  [ChecklistItemAnswer.NO] (weight= 50) Does the function handle negative inputs by raising an exception or otherwise signaling an error?


### RLCF Candidates-Only Mode (No Reference)

When you don't have a reference/gold-standard response, `ContrastiveGenerator` can still generate checklists
by analyzing auto-generated candidates. This is the `rlcf_candidates_only` preset.

**Use case:** Evaluating responses to open-ended questions where no "correct" answer exists.

In [14]:
# RLCF Candidates-Only Mode (No Reference Required)
# Useful when you don't have a gold-standard reference response

rlcf_no_ref = ContrastiveGenerator(
    method_name="rlcf_candidates_only",
    model=MODEL,
    candidate_models=["openai/gpt-4o-mini"],
    num_candidates=3,
)

# Open-ended input with no single "correct" answer
open_input = "Explain why the sky is blue in a way a 5-year-old would understand."

# Generate checklist WITHOUT providing a reference
# Candidates are auto-generated and compared to each other to identify criteria
no_ref_checklist = rlcf_no_ref.generate(input=open_input)
# Note: No reference argument! 

print("CANDIDATES-ONLY MODE (no reference)")
print("=" * 60)
print(f"Input: {open_input}")
print(f"\nMetadata shows candidates_only={no_ref_checklist.metadata.get('candidates_only')}")
print(f"Number of auto-generated candidates: {no_ref_checklist.metadata.get('num_candidates')}")
print(f"\nGenerated {len(no_ref_checklist.items)} weighted checklist items:")
for item in no_ref_checklist.items:
    print(f"  [{item.weight:3.0f}] {item.question}")

CANDIDATES-ONLY MODE (no reference)
Input: Explain why the sky is blue in a way a 5-year-old would understand.

Metadata shows candidates_only=None
Number of auto-generated candidates: 3

Generated 7 weighted checklist items:
  [100] Does the response give a clear explanation of why the sky appears blue (i.e., links sunlight + atmosphere to the blue appearance)?
  [100] Is the explanation factually accurate with no major scientific errors about the cause of the sky's color?
  [ 90] Does the response state that sunlight contains multiple colors (or that white light is made of many colors)?
  [ 95] Does the response state that blue light is scattered more by the air/atmosphere than other colors (or convey this idea in age-appropriate terms)?
  [ 90] Does the response use vocabulary and short sentence structure appropriate for a typical 5-year-old (simple words, no complex sentences)?
  [ 80] Does the response avoid unexplained technical jargon (or explain any necessary term in a kid-frie

In [15]:
# Score responses using the candidates-only checklist

print("Scoring with CANDIDATES-ONLY checklist:")
print("=" * 60)

# A child-friendly explanation
response1 = """The sky is blue because tiny bits of air scatter blue light from the sun 
more than other colors, like how glitter spreads everywhere when you throw it!"""

# A too-technical explanation (not appropriate for a 5-year-old)
response2 = """Due to Rayleigh scattering, shorter wavelength electromagnetic radiation 
is scattered more efficiently by atmospheric molecules, resulting in the preferential 
scattering of blue light (λ ≈ 450-495 nm) compared to longer wavelengths."""

for i, resp in enumerate([response1, response2], 1):
    score = weighted_scorer.score(no_ref_checklist, target=resp, input=open_input)
    print(f"\nResponse {i}: {resp[:70].strip()}...")
    print(f"Weighted score: {score.weighted_score:.0%}")
    print("Item breakdown:")
    for item, item_score in zip(no_ref_checklist.items, score.item_scores):
        print(f"  [{item_score.answer}] (weight={item.weight:3.0f}) {item.question[:55]}...")

Scoring with CANDIDATES-ONLY checklist:



Response 1: The sky is blue because tiny bits of air scatter blue light from the s...
Weighted score: 85%
Item breakdown:
  [ChecklistItemAnswer.YES] (weight=100) Does the response give a clear explanation of why the s...
  [ChecklistItemAnswer.YES] (weight=100) Is the explanation factually accurate with no major sci...
  [ChecklistItemAnswer.YES] (weight= 90) Does the response state that sunlight contains multiple...
  [ChecklistItemAnswer.YES] (weight= 95) Does the response state that blue light is scattered mo...
  [ChecklistItemAnswer.NO] (weight= 90) Does the response use vocabulary and short sentence str...
  [ChecklistItemAnswer.YES] (weight= 80) Does the response avoid unexplained technical jargon (o...
  [ChecklistItemAnswer.YES] (weight= 60) Does the response include an age-appropriate analogy or...



Response 2: Due to Rayleigh scattering, shorter wavelength electromagnetic radiati...
Weighted score: 32%
Item breakdown:
  [ChecklistItemAnswer.NO] (weight=100) Does the response give a clear explanation of why the s...
  [ChecklistItemAnswer.YES] (weight=100) Is the explanation factually accurate with no major sci...
  [ChecklistItemAnswer.NO] (weight= 90) Does the response state that sunlight contains multiple...
  [ChecklistItemAnswer.YES] (weight= 95) Does the response state that blue light is scattered mo...
  [ChecklistItemAnswer.NO] (weight= 90) Does the response use vocabulary and short sentence str...
  [ChecklistItemAnswer.NO] (weight= 80) Does the response avoid unexplained technical jargon (o...
  [ChecklistItemAnswer.NO] (weight= 60) Does the response include an age-appropriate analogy or...


---
## 3. RocketEval pipeline + ChecklistScorer (logprobs mode)

**RocketEval pipeline** generates checklists from query + reference response via `DirectGenerator`.

**ChecklistScorer with `use_logprobs=True`** uses logprobs to calculate confidence levels:
- YES_90: Very confident yes (≥85%)
- YES_70: Confident yes (65-85%)
- UNSURE: Uncertain (35-65%)
- NO_30: Confident no (15-35%)
- NO_10: Very confident no (<15%)

In [16]:
# Initialize RocketEval generator
rocketeval = DirectGenerator(method_name="rocketeval", model=MODEL)

input_text = "What are the three laws of thermodynamics? Explain each briefly."
reference = """The three laws of thermodynamics are:

1. First Law (Conservation of Energy): Energy cannot be created or destroyed, only transformed. 
   The total energy of an isolated system remains constant.

2. Second Law (Entropy): In any energy transfer, the total entropy of a system and its 
   surroundings always increases. Heat flows spontaneously from hot to cold.

3. Third Law (Absolute Zero): As temperature approaches absolute zero, the entropy of a 
   perfect crystal approaches zero. It's impossible to reach absolute zero in finite steps."""

checklist = rocketeval.generate(input=input_text, reference=reference)

print(f"Input: {input_text}")
print(f"\nGenerated {len(checklist.items)} checklist items:")
for i, item in enumerate(checklist.items, 1):
    print(f"  {i}. {item.question}")

Input: What are the three laws of thermodynamics? Explain each briefly.

Generated 8 checklist items:
  1. Does the response list exactly three distinct laws (no more, no fewer)?
  2. Is the First Law described as conservation of energy—energy cannot be created or destroyed and the total energy of an isolated system is constant?
  3. Is the Second Law described as the tendency for total entropy of an isolated system/universe to increase (or never decrease) and that heat flows spontaneously from hot to cold?
  4. Is the Third Law described as the entropy of a perfect crystal approaching zero as temperature approaches absolute zero and that absolute zero cannot be reached in a finite number of steps?
  5. Are each of the three laws clearly labeled (First/Second/Third) and given a brief explanation?
  6. Are the explanations concise (short, to the point) rather than long derivations or unrelated detail?
  7. Are there no major factual errors or misleading absolute statements (e.g., claimi

In [17]:
# Score with ChecklistScorer (logprobs mode)
normalized_scorer = ChecklistScorer(mode="item", use_logprobs=True, primary_metric="normalized", model="openai/gpt-4o-mini")

# Good response
good_response = """The three laws of thermodynamics:

1. First Law: Energy is conserved - it can change forms but the total amount stays constant.

2. Second Law: Entropy (disorder) always increases in isolated systems. This is why heat flows from hot to cold and not the reverse.

3. Third Law: At absolute zero temperature, a perfect crystal has zero entropy. We can never actually reach absolute zero."""

score = normalized_scorer.score(checklist, target=good_response, input=input_text)

print("GOOD Response:")
print(good_response[:200] + "...")
print(f"\nNormalized score: {score.normalized_score:.0%}")
print("\nConfidence levels:")
for item, item_score in zip(checklist.items, score.item_scores):
    print(f"  [{item_score.confidence_level.name:6}] {item.question[:60]}...")

GOOD Response:
The three laws of thermodynamics:

1. First Law: Energy is conserved - it can change forms but the total amount stays constant.

2. Second Law: Entropy (disorder) always increases in isolated systems....

Normalized score: 50%

Confidence levels:
  [YES_90] Does the response list exactly three distinct laws (no more,...
  [NO_10 ] Is the First Law described as conservation of energy—energy ...
  [NO_10 ] Is the Second Law described as the tendency for total entrop...
  [NO_10 ] Is the Third Law described as the entropy of a perfect cryst...
  [YES_90] Are each of the three laws clearly labeled (First/Second/Thi...
  [YES_90] Are the explanations concise (short, to the point) rather th...
  [NO_10 ] Are there no major factual errors or misleading absolute sta...
  [YES_90] Does the response use correct technical terms appropriately ...


In [18]:
# Partial response (missing some laws)
partial_response = """The laws of thermodynamics:

The first law says energy is conserved. You can't create or destroy energy.

The second law is about entropy always increasing."""

score = normalized_scorer.score(checklist, target=partial_response, input=input_text)

print("PARTIAL Response (missing third law):")
print(partial_response)
print(f"\nNormalized score: {score.normalized_score:.0%}")
print("\nConfidence levels:")
for item, item_score in zip(checklist.items, score.item_scores):
    level = item_score.confidence_level.name if item_score.confidence_level else "N/A"
    print(f"  [{level:6}] {item.question[:60]}...")

PARTIAL Response (missing third law):
The laws of thermodynamics:

The first law says energy is conserved. You can't create or destroy energy.

The second law is about entropy always increasing.

Normalized score: 12%

Confidence levels:
  [NO_10 ] Does the response list exactly three distinct laws (no more,...
  [NO_10 ] Is the First Law described as conservation of energy—energy ...
  [NO_10 ] Is the Second Law described as the tendency for total entrop...
  [NO_10 ] Is the Third Law described as the entropy of a perfect cryst...
  [NO_10 ] Are each of the three laws clearly labeled (First/Second/Thi...
  [YES_90] Are the explanations concise (short, to the point) rather th...
  [NO_10 ] Are there no major factual errors or misleading absolute sta...
  [NO_10 ] Does the response use correct technical terms appropriately ...


### ChecklistScorer Without Logprobs Support

Not all models support logprobs output. When you use `use_logprobs=True` with a model that doesn't support logprobs:

1. A **warning is emitted at initialization** (not per-item)
2. Scoring falls back to **text-based Yes/No parsing**
3. `confidence` and `confidence_level` are set to **`None`**
4. `normalized_score` **equals `pass_rate`** (unweighted scoring)

This matches the RocketEval paper's behavior of returning `NaN` when logprobs are unavailable.

In [19]:
# ChecklistScorer with use_logprobs=True but a model that does NOT support logprobs
import warnings

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    scorer_no_logprobs = ChecklistScorer(mode="item", use_logprobs=True, primary_metric="normalized", model="openai/gpt-5-mini")
    
    if w:
        print(f"WARNING: {w[0].message}")
    print(f"\n_logprobs_available = {scorer_no_logprobs._logprobs_available}")

# Score using the same checklist
score = scorer_no_logprobs.score(checklist, target=good_response, input=input_text)

print(f"\nScoring without logprobs:")
print(f"  Pass rate: {score.pass_rate:.0%}")
print(f"  Normalized score: {score.normalized_score:.0%}  (equals pass_rate when no logprobs)")

print(f"\nItem scores (confidence and confidence_level are None):")
for item, item_score in zip(checklist.items, score.item_scores):
    print(f"  [{item_score.answer.value:3}] confidence={item_score.confidence}, level={item_score.confidence_level}")
    print(f"       {item.question[:55]}...")


_logprobs_available = False



Scoring without logprobs:
  Pass rate: 75%
  Normalized score: 75%  (equals pass_rate when no logprobs)

Item scores (confidence and confidence_level are None):
  [yes] confidence=None, level=None
       Does the response list exactly three distinct laws (no ...
  [yes] confidence=None, level=None
       Is the First Law described as conservation of energy—en...
  [yes] confidence=None, level=None
       Is the Second Law described as the tendency for total e...
  [no ] confidence=None, level=None
       Is the Third Law described as the entropy of a perfect ...
  [yes] confidence=None, level=None
       Are each of the three laws clearly labeled (First/Secon...
  [yes] confidence=None, level=None
       Are the explanations concise (short, to the point) rath...
  [no ] confidence=None, level=None
       Are there no major factual errors or misleading absolut...
  [yes] confidence=None, level=None
       Does the response use correct technical terms appropria...


---
## Bonus: InteractEval 1-5 Scaling

The `Score` model includes a `scaled_score_1_5` property that scales the pass_rate (0-1) to a 1-5 score.
This follows the InteractEval paper's scaling formula: `score = pass_rate * 4 + 1`

This is useful when you want to report scores on a more intuitive 1-5 scale (like star ratings).

In [20]:
# Demonstrate 1-5 scaling using scores from earlier examples
tick = DirectGenerator(method_name="tick", model=MODEL)
scorer = ChecklistScorer(mode="batch", model=MODEL)

input_text = "Write a haiku about autumn leaves."
checklist = tick.generate(input=input_text)

# Score different quality responses
examples = [
    ("Perfect haiku", "Crimson leaves descend\nDancing on autumn's last breath\nEarth embraces gold"),
    ("Prose instead", "Autumn leaves are beautiful. I enjoy watching them fall."),
]

print("InteractEval 1-5 Scaling Demo")
print("=" * 60)
print("Formula: scaled_score = pass_rate * 4 + 1")
print()

for name, response in examples:
    score = scorer.score(checklist, target=response, input=input_text)
    print(f"{name}:")
    print(f"  Pass rate: {score.pass_rate:.0%}")
    print(f"  Scaled (1-5): {score.scaled_score_1_5:.2f}")
    print()

InteractEval 1-5 Scaling Demo
Formula: scaled_score = pass_rate * 4 + 1



Perfect haiku:
  Pass rate: 100%
  Scaled (1-5): 5.00



Prose instead:
  Pass rate: 40%
  Scaled (1-5): 2.60



---
## Summary

| Profile | Generator | Scorer Config | Use Case |
|--------|-----------|--------------|----------|
| **tick** | DirectGenerator | `ChecklistScorer(mode="batch")` | Quick evaluation, no reference needed |
| **tick** | DirectGenerator | `ChecklistScorer(mode="item", capture_reasoning=True)` | Per-item reasoning (TICK paper) |
| **rlcf_direct** | DirectGenerator | `ChecklistScorer(mode="item", primary_metric="weighted")` | Detailed evaluation with importance weights |
| **rlcf_candidate** | ContrastiveGenerator | `ChecklistScorer(mode="item", primary_metric="weighted")` | Identifies failure modes from examples |
| **rlcf_candidates_only** | ContrastiveGenerator | `ChecklistScorer(mode="item", primary_metric="weighted")` | Open-ended tasks without gold standard |
| **rocketeval** | DirectGenerator | `ChecklistScorer(mode="item", use_logprobs=True, primary_metric="normalized")` | Confidence-aware evaluation |

### Scorer Modes
- **`mode="batch"`**: Evaluates ALL checklist items in one LLM call (efficient)
- **`mode="item"`**: Evaluates ONE item per LLM call (more accurate, supports reasoning/logprobs)

### RLCF Modes (via ContrastiveGenerator)
- **rlcf_candidate**: Analyzes candidate responses to identify failure modes
  - `candidates=[...]`: Pass explicit candidates
  - `candidate_models=[...]`: Auto-generate candidates using smaller models (recommended)
  - With reference: Compares candidates to reference
  - **Without reference** (`rlcf_candidates_only`): Compares candidates to each other