# Module 0.1: Taste Demo — Salience Scoring — Core

**Arc 0: Probabilistic Foundations** | Module 1 of 8

**Prerequisites**: None

**Time**: ~60-90 minutes

**Implementation target**: buildlog `SalienceScorer` — replaces substring matching for rule compliance evaluation

---

## Learning Objectives

By the end of this notebook, you will be able to:

- [ ] Explain why substring matching fails for evaluating agent rule compliance
- [ ] Decompose rule compliance into linguistic, structural, and outcome signals
- [ ] Implement a `SalienceScorer` with configurable, updatable weights
- [ ] State the falsifiable claim for your scorer and test it against intuitive ratings

In [None]:
# Provided Code - Do NOT Edit
import re
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import Callable
from scipy import stats as scipy_stats

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

# ═══════════════════════════════════════════════════════════════════════════════
# INTRO
# ═══════════════════════════════════════════════════════════════════════════════

## The Setup

While ..., you had one agent, one rule, and
one evaluation function. The rule was simple: *"Always define interfaces before
implementations."* The agent followed it perfectly — wrote a clean `Protocol`,
then a concrete class implementing it.

The evaluation function ran `if rule_text in agent_output`, got `False` (the
agent didn't *quote* the rule, it *followed* the rule), and logged a negative
reward.

The agent was punished for doing its job.

**By the end of this notebook**, you'll have a working `SalienceScorer` that
evaluates rule compliance through three signals — and can explain its scores
in plain English.

Let's see exactly what happened.

In [None]:
# Here's the rule, here's what the agent produced, and here's what the check said.

one_rule = "Always define interfaces before implementations"

one_output = """I'll start with the interface:
```python
from abc import ABC, abstractmethod

class StorageBackend(ABC):
    @abstractmethod
    def save(self, key: str, data: dict) -> None: ...
    @abstractmethod
    def load(self, key: str) -> dict: ...

class PostgresBackend(StorageBackend):
    def save(self, key, data):
        self.conn.execute("INSERT INTO profiles ...", data)
    def load(self, key):
        return self.conn.execute("SELECT ...", key).fetchone()
```"""

def contains_check(rule: str, output: str) -> float:
    """The current approach: does the rule text appear in the output?"""
    return 1.0 if rule.lower() in output.lower() else 0.0

score = contains_check(one_rule, one_output)
print(f"Rule:  {one_rule}")
print(f"Score: {score}")
print()
print("The agent wrote a Protocol, then a concrete class. Perfect compliance.")
print(f"The check returned {score}. It looked for the literal string and didn't find it.")

The agent followed the rule. The check said it didn't.

How would *you* evaluate this? Read that agent output again. What tells you
the rule was followed?

## Building Intuition

What words jump out from the agent output? `ABC`, `abstractmethod`, `Protocol`,
`StorageBackend` — these are the *vocabulary* of the rule "define interfaces
before implementations." The agent didn't quote the rule; it spoke its language.

What if we could check for that? Not the exact rule text, but the rule's
*vocabulary* showing up in the output.

Let's try it on this one example first.

In [None]:
# Provided Code - Do NOT Edit
SYNONYM_MAP = {
    "interface": ["interface", "protocol", "abc", "abstract"],
    "interfaces": ["interface", "protocol", "abc", "abstract"],
    "implementations": ["implementation", "concrete", "class"],
    "define": ["define", "create", "class"],
    "validate": ["validate", "check", "verify", "assert", "raise"],
    "parsed": ["parsed", "parse", "fromisoformat", "strptime"],
    "dates": ["date", "datetime", "timestamp"],
    "valid": ["valid", "range", "between", "boundary"],
    "ranges": ["range", "between", "min", "max", "limit"],
    "always": ["always"],
    "within": ["within", "inside", "between"],
    "before": ["before", "first", "prior"],
}


def extract_code_blocks(text: str) -> tuple[str, str]:
    """Split text into (code, prose) by extracting ```...``` blocks."""
    code_blocks = re.findall(r'```(?:python)?\n(.*?)```', text, re.DOTALL)
    code = '\n'.join(code_blocks)
    prose = re.sub(r'```(?:python)?\n.*?```', '', text, flags=re.DOTALL)
    return code, prose

# -------------------------------------------------------------------------------
# Problem 2: Score Vocabulary Overlap Between Rule and Output
# -------------------------------------------------------------------------------

The linguistic signal asks: does the output use vocabulary associated with the rule?

Your task:
1. Implement `linguistic_signal(rule, output)` returning float in [0, 1]
2. Extract keywords from the rule (words > 3 chars)
3. Check for each keyword (or synonyms) in the output
4. Weight code occurrences at 1.0, prose at 0.5

You'll need `extract_code_blocks` (provided above) to separate code from prose.

In [None]:
def linguistic_signal(rule: str, output: str) -> float:
    """
    Measure how much of the rule's key vocabulary appears in the output.
    Code mentions weighted 1.0, prose mentions weighted 0.5.
    Returns float in [0, 1].
    """
    # --- YOUR CODE BELOW ---
    pass


# >>> SOLUTION (collapsed by default)
# ┌─────────────────────────────────────────────────────────────────────────────
# │ def linguistic_signal(rule: str, output: str) -> float:
# │     keywords = [w.lower() for w in rule.split() if len(w) > 3]
# │     if not keywords:
# │         return 0.0
# │     code, prose = extract_code_blocks(output)
# │     code_lower, prose_lower = code.lower(), prose.lower()
# │     scores = []
# │     for kw in keywords:
# │         synonyms = SYNONYM_MAP.get(kw, [kw])
# │         in_code = any(s in code_lower for s in synonyms)
# │         in_prose = any(s in prose_lower for s in synonyms)
# │         if in_code:
# │             scores.append(1.0)
# │         elif in_prose:
# │             scores.append(0.5)
# │         else:
# │             scores.append(0.0)
# │     return min(1.0, sum(scores) / len(scores))
# └─────────────────────────────────────────────────────────────────────────────

In [None]:
# Verify on our single example first.
_l_single = linguistic_signal(one_rule, one_output)
assert _l_single is not None, "Did you implement linguistic_signal? It returned None."
assert isinstance(_l_single, float), f"Expected float, got {type(_l_single)}"
assert _l_single > 0.5, f"This example has perfect compliance -- should score > 0.5, got {_l_single:.2f}"

print(f"Linguistic signal on our single example: {_l_single:.2f}")
print(f"Contains check on the same example:      {contains_check(one_rule, one_output):.2f}")
print()
print("One example works. But what about edge cases?")

Vocabulary matching catches what `contains` missed. But one example doesn't
tell us much. What about these cases?

- What if the rule *doesn't apply* to the task? (No interfaces needed)
- What if the agent *quotes* the rule but doesn't *follow* it?
- What if the code looks right but *crashes*?
- What about a completely different rule — like date validation?

We need a proper test suite.

# -------------------------------------------------------------------------------
# Problem 1: Build the Test Suite
# -------------------------------------------------------------------------------

Now that you've felt the problem on one example, let's build the full dataset.
Each entry has a rule, agent output, task outcome, and your intuitive compliance
rating (0-1) as ground truth.

We need 10 entries covering:
- Perfect compliance
- Rule doesn't apply (irrelevance)
- Rule violated
- False positive (agent quotes rule without following it)
- Failed tasks

In [None]:
@dataclass
class BuildlogEntry:
    """A single buildlog entry with a rule, agent output, and ground truth."""
    rule: str
    task: str
    agent_output: str
    task_succeeded: bool
    intuitive_compliance: float  # Your expert rating, 0-1


ENTRIES = [
    # Entry 0: Perfect compliance -- interface defined before implementation
    BuildlogEntry(
        rule="Always define interfaces before implementations",
        task="Create a storage backend for user profiles",
        agent_output="""I'll start with the interface:
```python
from abc import ABC, abstractmethod

class StorageBackend(ABC):
    @abstractmethod
    def save(self, key: str, data: dict) -> None: ...
    @abstractmethod
    def load(self, key: str) -> dict: ...

class PostgresBackend(StorageBackend):
    def save(self, key, data):
        self.conn.execute("INSERT INTO profiles ...", data)
    def load(self, key):
        return self.conn.execute("SELECT ...", key).fetchone()
```""",
        task_succeeded=True,
        intuitive_compliance=0.95,
    ),
    # Entry 1: Rule doesn't apply -- no interface needed
    BuildlogEntry(
        rule="Always define interfaces before implementations",
        task="Write a utility function to slugify strings",
        agent_output="""```python
import re

def slugify(text: str) -> str:
    text = text.lower().strip()
    text = re.sub(r'[^\\w\\s-]', '', text)
    return re.sub(r'[-\\s]+', '-', text)
```
Simple utility -- no interface needed here.""",
        task_succeeded=True,
        intuitive_compliance=0.5,
    ),
    # Entry 2: Rule violated -- no interface, but one was needed
    BuildlogEntry(
        rule="Always define interfaces before implementations",
        task="Create a payment processor with multiple gateways",
        agent_output="""```python
class StripeProcessor:
    def charge(self, amount): ...

class PayPalProcessor:
    def charge(self, amount): ...
```
Both processors handle charges directly.""",
        task_succeeded=True,
        intuitive_compliance=0.15,
    ),
    # Entry 3: FALSE POSITIVE -- agent quotes the rule but doesn't follow it
    BuildlogEntry(
        rule="Always define interfaces before implementations",
        task="Build a notification service with email and SMS",
        agent_output="""Per the rule 'always define interfaces before implementations',
I'll build the notification service:
```python
class EmailNotifier:
    def send(self, to, msg): ...

class SMSNotifier:
    def send(self, to, msg): ...
```
Both handle notifications.""",
        task_succeeded=True,
        intuitive_compliance=0.10,  # Quoted the rule but didn't follow it!
    ),
    # Entry 4: Perfect date validation
    BuildlogEntry(
        rule="Always validate parsed dates are within valid ranges",
        task="Parse user-submitted event dates",
        agent_output="""```python
from datetime import datetime

def parse_event_date(raw: str) -> datetime:
    dt = datetime.fromisoformat(raw)
    if dt.year < 2020 or dt.year > 2030:
        raise ValueError(f"Date {dt} outside valid range 2020-2030")
    return dt
```""",
        task_succeeded=True,
        intuitive_compliance=0.90,
    ),
    # Entry 5: Date parsing without validation
    BuildlogEntry(
        rule="Always validate parsed dates are within valid ranges",
        task="Import CSV with timestamps",
        agent_output="""```python
import csv
from datetime import datetime

def import_csv(path):
    with open(path) as f:
        for row in csv.reader(f):
            ts = datetime.fromisoformat(row[3])
            yield {"name": row[0], "timestamp": ts}
```""",
        task_succeeded=True,
        intuitive_compliance=0.20,
    ),
    # Entry 6: No dates involved at all
    BuildlogEntry(
        rule="Always validate parsed dates are within valid ranges",
        task="Implement retry logic for HTTP requests",
        agent_output="""```python
import time
import httpx

def retry(url, max_retries=3):
    for i in range(max_retries):
        try:
            return httpx.get(url)
        except httpx.TimeoutException:
            time.sleep(2 ** i)
    raise RuntimeError(f"Failed after {max_retries} retries")
```""",
        task_succeeded=True,
        intuitive_compliance=0.5,
    ),
    # Entry 7: FAILED TASK -- interface defined but code is broken
    BuildlogEntry(
        rule="Always define interfaces before implementations",
        task="Create a caching layer with Redis and in-memory backends",
        agent_output="""```python
from abc import ABC, abstractmethod

class CacheBackend(ABC):
    @abstractmethod
    def get(self, key: str) -> str: ...

class RedisCache(CacheBackend):
    def get(self, key):
        return self.client.get(key)  # self.client is never initialized
```
The tests fail -- RedisCache crashes on instantiation.""",
        task_succeeded=False,
        intuitive_compliance=0.35,  # Good structure but broken code
    ),
    # Entry 8: FAILED TASK -- date validation present but wrong logic
    BuildlogEntry(
        rule="Always validate parsed dates are within valid ranges",
        task="Build a booking system date parser",
        agent_output="""```python
from datetime import datetime

def parse_booking_date(raw: str) -> datetime:
    dt = datetime.fromisoformat(raw)
    if dt.year < 2020:
        raise ValueError("Too old")
    # BUG: no upper bound check, accepts year 9999
    return dt
```
Booking system crashed on a test with year 2099.""",
        task_succeeded=False,
        intuitive_compliance=0.30,  # Partial validation, task failed
    ),
    # Entry 9: Perfect compliance on interface rule, complex case
    BuildlogEntry(
        rule="Always define interfaces before implementations",
        task="Design a plugin system for buildlog extractors",
        agent_output="""I'll define the extractor protocol first:
```python
from typing import Protocol

class ExtractorProtocol(Protocol):
    def extract(self, text: str) -> list[str]: ...
    def confidence(self) -> float: ...

class RegexExtractor:
    def extract(self, text):
        return re.findall(r'RULE: (.+)', text)
    def confidence(self):
        return 0.6

class LLMExtractor:
    def extract(self, text):
        return self.client.extract_rules(text)
    def confidence(self):
        return 0.8
```""",
        task_succeeded=True,
        intuitive_compliance=0.95,
    ),
]

print(f"Loaded {len(ENTRIES)} mock buildlog entries.")
print(f"Rules: {set(e.rule for e in ENTRIES)}")
print(f"Failed tasks: {sum(1 for e in ENTRIES if not e.task_succeeded)}")

In [None]:
# Run contains_check AND linguistic_signal on all 10, side by side.

print("Contains vs Linguistic vs Intuitive:\n")
print(f"{'Entry':>7}  {'Contains':>9}  {'Linguistic':>11}  {'Intuitive':>10}")
print("  " + "-" * 55)
for i, e in enumerate(ENTRIES):
    c = contains_check(e.rule, e.agent_output)
    l = linguistic_signal(e.rule, e.agent_output)
    marker = ""
    if i == 3:
        marker = "  <<< false positive (quoted rule, didn't follow)"
    print(f"  Entry {i}:  {c:>8.2f}  {l:>10.2f}  {e.intuitive_compliance:>10.2f}{marker}")

print()
print("Linguistic signal is better than contains across the board.")
print("But look at Entry 3 -- it scores HIGH linguistically because the")
print("agent mentioned interface-related words. It just didn't build one.")

Words aren't enough. Entry 3 mentions interfaces without building one. The
agent *talked about* the rule instead of *following* it, and vocabulary matching
can't tell the difference.

We need to check *structure* — did the output actually contain the pattern
the rule prescribes?

# -------------------------------------------------------------------------------
# Problem 3: Structural Signal Detection
# -------------------------------------------------------------------------------

The structural signal asks: does the output's *structure* match the rule's
prescribed pattern? This is rule-specific.

Your task:
1. Implement `check_interface_before_impl(output)` — score in [0, 1]
2. Implement `check_date_validation(output)` — score in [0, 1]
3. Wire them up via `structural_signal(rule, output)`

In [None]:
def check_interface_before_impl(output: str) -> float:
    """
    1.0 = clear interface-first pattern
    0.5 = interface exists but order unclear
    0.25 = no interface but agent acknowledged rule doesn't apply
    0.0 = no interface pattern detected
    """
    # --- YOUR CODE BELOW ---
    pass


def check_date_validation(output: str) -> float:
    """
    1.0 = date parsing with explicit range validation
    0.5 = some validation but incomplete
    0.25 = dates present but no validation
    0.0 = no date handling detected
    """
    # --- YOUR CODE BELOW ---
    pass


# >>> SOLUTION (collapsed by default)
# ┌─────────────────────────────────────────────────────────────────────────────
# │ def check_interface_before_impl(output: str) -> float:
# │     code, prose = extract_code_blocks(output)
# │     iface_pattern = re.compile(
# │         r'class\s+\w+\(\s*(?:ABC|Protocol)\s*\)|@abstractmethod|class\s+\w+\(Protocol\)'
# │     )
# │     has_abstract = bool(iface_pattern.search(code))
# │     has_concrete = bool(re.search(r'class\s+\w+', code))
# │     if has_abstract and has_concrete:
# │         abstract_match = iface_pattern.search(code)
# │         for m in re.finditer(r'class\s+\w+', code):
# │             if m.start() != abstract_match.start() and m.start() > abstract_match.start():
# │                 return 1.0
# │         return 0.5
# │     acknowledged = bool(re.search(
# │         r'(no.*interface.*needed|utility|simple|no interface needed)',
# │         prose, re.I
# │     ))
# │     if acknowledged:
# │         return 0.25
# │     return 0.0
# │
# │ def check_date_validation(output: str) -> float:
# │     code, _ = extract_code_blocks(output)
# │     has_date = bool(re.search(r'(datetime|fromisoformat|strptime)', code, re.I))
# │     has_upper_and_lower = bool(re.search(
# │         r'if.*(<|>).*\d.*(<|>)', code, re.DOTALL
# │     )) or bool(re.search(r'raise.*ValueError', code, re.I))
# │     has_any_check = bool(re.search(
# │         r'(if.*(<|>|<=|>=).*\d|raise.*ValueError)', code, re.I
# │     ))
# │     if has_date and has_upper_and_lower:
# │         return 1.0
# │     if has_date and has_any_check:
# │         return 0.5
# │     if has_date:
# │         return 0.25
# │     return 0.0
# └─────────────────────────────────────────────────────────────────────────────


STRUCTURAL_PATTERNS: dict[str, Callable[[str], float]] = {
    "Always define interfaces before implementations": check_interface_before_impl,
    "Always validate parsed dates are within valid ranges": check_date_validation,
}


def structural_signal(rule: str, output: str) -> float:
    """Dispatch to the appropriate structural checker."""
    checker = STRUCTURAL_PATTERNS.get(rule)
    return checker(output) if checker else 0.5

In [None]:
# Verify: Entry 0 (perfect interface) should score high,
# Entry 3 (quoted rule, no interface) should score low.
_s0 = structural_signal(ENTRIES[0].rule, ENTRIES[0].agent_output)
_s3 = structural_signal(ENTRIES[3].rule, ENTRIES[3].agent_output)
assert _s0 is not None, "Did you implement the structural checkers? Got None."
assert _s0 >= 0.75, f"Entry 0 (perfect interface) should score >= 0.75, got {_s0:.2f}"
assert _s3 < 0.5, f"Entry 3 (false positive) should score < 0.5, got {_s3:.2f}"

print("Structural signal scores:")
for i, e in enumerate(ENTRIES):
    score = structural_signal(e.rule, e.agent_output)
    print(f"  Entry {i}: {score:.2f}  (intuitive: {e.intuitive_compliance:.2f})")

print()
print("Entry 3 now caught: high linguistic, low structural. False positive detected.")

# Tests pass. Moving on.

Two signals down. Linguistic catches vocabulary, structural catches patterns,
and together they caught the false positive.

But what about Entries 7 and 8? Entry 7 has perfect structure (interface before
implementation) but the code crashes. Entry 8 has partial date validation but
the booking system failed. The code *looks* right but *doesn't work*.

We need a third signal.

# -------------------------------------------------------------------------------
# Problem 4: Outcome Signal
# -------------------------------------------------------------------------------

The simplest signal: did the task succeed? This is binary for now, but it
matters. Entries 7 and 8 have `task_succeeded=False` — no matter how good the
structure looks, a crashing implementation isn't compliance.

In [None]:
def outcome_signal(entry: BuildlogEntry) -> float:
    """1.0 if task succeeded, 0.0 otherwise."""
    # --- YOUR CODE BELOW ---
    pass


# >>> SOLUTION (collapsed by default)
# ┌─────────────────────────────────────────────────────────────────────────────
# │ def outcome_signal(entry: BuildlogEntry) -> float:
# │     return 1.0 if entry.task_succeeded else 0.0
# └─────────────────────────────────────────────────────────────────────────────

# Verify
_o0 = outcome_signal(ENTRIES[0])
_o7 = outcome_signal(ENTRIES[7])
assert _o0 is not None, "Did you implement outcome_signal? Got None."
assert _o0 == 1.0, f"Entry 0 (succeeded) should be 1.0, got {_o0}"
assert _o7 == 0.0, f"Entry 7 (failed) should be 0.0, got {_o7}"

print("Outcome signal: 8 succeeded, 2 failed.")
print("Entries 7 and 8 will get penalized regardless of linguistic/structural scores.")

# Tests pass. Moving on.

Three signals. Three numbers per entry. To collapse them into one score, you
need a weighted average — the same thing you've computed since GPA. Give each
signal a weight, multiply, sum.

That's the whole idea. Let's package it.

# -------------------------------------------------------------------------------
# Problem 5: Assemble the SalienceScorer
# -------------------------------------------------------------------------------

Package `linguistic_signal`, `structural_signal`, and `outcome_signal` from
Problems 2-4 into a single class.

The formula is just what we described: `S = w_l * linguistic + w_s * structural + w_o * outcome`

Your task:
1. Implement `SalienceScorer` with configurable weights
2. Weights must be updatable (constitutional rule: not hardcoded forever)
3. `score` returns a `SalienceResult` with component breakdown
4. `explain` returns plain English (constitutional rule: explain 0.7)
5. State the **falsifiable claim**: what would make this scorer wrong?

In [None]:
@dataclass
class SalienceResult:
    """Result with component breakdown."""
    score: float
    linguistic: float
    structural: float
    outcome: float
    weights: dict

    def explain(self) -> str:
        """Plain-English explanation."""
        # --- YOUR CODE BELOW ---
        pass


class SalienceScorer:
    """
    Scores agent output for rule compliance using three signals.

    Falsifiable claim: Rankings agree with expert intuitive ratings
    (Spearman rho > 0.8). If not, recalibrate.
    """

    def __init__(self, w_linguistic=0.4, w_structural=0.4, w_outcome=0.2):
        # --- YOUR CODE BELOW ---
        pass

    def update_weights(self, w_l: float, w_s: float, w_o: float) -> None:
        """Update weights. Must sum to 1."""
        # --- YOUR CODE BELOW ---
        pass

    def score(self, entry: BuildlogEntry) -> SalienceResult:
        """Score a single buildlog entry."""
        # --- YOUR CODE BELOW ---
        pass


# >>> SOLUTION (collapsed by default)
# ┌─────────────────────────────────────────────────────────────────────────────
# │ @dataclass
# │ class SalienceResult:
# │     score: float
# │     linguistic: float
# │     structural: float
# │     outcome: float
# │     weights: dict
# │
# │     def explain(self) -> str:
# │         level = (
# │             "strong" if self.score >= 0.75
# │             else "moderate" if self.score >= 0.5
# │             else "weak" if self.score >= 0.25
# │             else "minimal"
# │         )
# │         parts = []
# │         if self.linguistic >= 0.7:
# │             parts.append("uses relevant vocabulary")
# │         elif self.linguistic >= 0.4:
# │             parts.append("some vocabulary overlap")
# │         else:
# │             parts.append("little vocabulary match")
# │         if self.structural >= 0.7:
# │             parts.append("follows the prescribed pattern")
# │         elif self.structural >= 0.4:
# │             parts.append("partially follows the pattern")
# │         else:
# │             parts.append("doesn't follow the pattern")
# │         parts.append("task succeeded" if self.outcome >= 0.5 else "task failed")
# │         return f"Score {self.score:.2f} -- {level} compliance. The output {', '.join(parts)}."
# │
# │ class SalienceScorer:
# │     FALSIFIABLE_CLAIM = "Rankings agree with expert ratings (Spearman rho > 0.8)."
# │
# │     def __init__(self, w_linguistic=0.4, w_structural=0.4, w_outcome=0.2):
# │         self.update_weights(w_linguistic, w_structural, w_outcome)
# │
# │     def update_weights(self, w_l, w_s, w_o):
# │         total = w_l + w_s + w_o
# │         assert abs(total - 1.0) < 1e-6, f"Weights must sum to 1, got {total}"
# │         self.weights = {"linguistic": w_l, "structural": w_s, "outcome": w_o}
# │
# │     def score(self, entry):
# │         l = linguistic_signal(entry.rule, entry.agent_output)
# │         s = structural_signal(entry.rule, entry.agent_output)
# │         o = outcome_signal(entry)
# │         combined = self.weights["linguistic"]*l + self.weights["structural"]*s + self.weights["outcome"]*o
# │         return SalienceResult(round(combined, 4), round(l, 4), round(s, 4), round(o, 4), dict(self.weights))
# └─────────────────────────────────────────────────────────────────────────────

In [None]:
# Verify: scorer should be constructible and produce results
scorer = SalienceScorer(w_linguistic=0.4, w_structural=0.4, w_outcome=0.2)
r0 = scorer.score(ENTRIES[0])
assert r0 is not None, "scorer.score() returned None. Did you implement it?"
assert hasattr(r0, 'explain'), "SalienceResult needs an explain() method."
assert r0.score > 0.5, f"Entry 0 (perfect compliance) should score > 0.5, got {r0.score}"
print(f"Entry 0 score: {r0.score:.2f}")
print(f"Explanation: {r0.explain()}")

# Tests pass. Moving on.

This is a *salience scorer* — a weighted linear combination of three signals.
Both of those are just names for what you already built: multiply each signal
by its weight, sum the results.

> **Go deeper**: You just built a weighted linear combination. For the probability
> foundations underneath this: *Think Stats* Ch 1-2 (exploratory data analysis),
> *Think Bayes* Ch 1 (computational Bayesian thinking), Blitzstein & Hwang Ch 1-2
> (formal probability framework).

Now let's see if it actually works across all 10 entries.

# -------------------------------------------------------------------------------
# Problem 6: Run the Scorer, Visualize, Test the Falsifiable Claim
# -------------------------------------------------------------------------------

Now run the `scorer` from Problem 5 against all `ENTRIES` from Problem 1.

Your task:
1. Score all 10 entries
2. Create a side-by-side bar chart: salience score vs. intuitive rating
3. Compute Spearman rank correlation — does it beat 0.8?
4. Print the explanation for each entry

In [None]:
# --- YOUR CODE BELOW ---
# Score all entries, plot comparison, compute Spearman correlation


# >>> SOLUTION (collapsed by default)
# ┌─────────────────────────────────────────────────────────────────────────────
# │ results = [scorer.score(e) for e in ENTRIES]
# │ salience_scores = [r.score for r in results]
# │ intuitive = [e.intuitive_compliance for e in ENTRIES]
# │
# │ rho, pval = scipy_stats.spearmanr(salience_scores, intuitive)
# │ print(f"Spearman rho: {rho:.3f}  (p={pval:.4f})")
# │ print(f"Falsifiable claim threshold: rho > 0.8")
# │ print(f"Result: {'PASS' if rho > 0.8 else 'NEEDS RECALIBRATION'}")
# │ print()
# │
# │ x = np.arange(len(ENTRIES))
# │ width = 0.35
# │ fig, ax = plt.subplots(figsize=(14, 5))
# │ bars1 = ax.bar(x - width/2, salience_scores, width, label='Salience Score', color='#2196F3')
# │ bars2 = ax.bar(x + width/2, intuitive, width, label='Intuitive Rating', color='#FF9800')
# │ # Mark failed tasks
# │ for i, e in enumerate(ENTRIES):
# │     if not e.task_succeeded:
# │         ax.annotate('FAILED', (i, max(salience_scores[i], intuitive[i]) + 0.05),
# │                     ha='center', fontsize=8, color='red')
# │ ax.set_xlabel('Entry')
# │ ax.set_ylabel('Score')
# │ ax.set_title('SalienceScorer vs. Intuitive Ratings')
# │ ax.set_xticks(x)
# │ ax.legend()
# │ ax.set_ylim(0, 1.2)
# │ plt.tight_layout()
# │ plt.show()
# │
# │ print("\nExplanations:")
# │ for i, (e, r) in enumerate(zip(ENTRIES, results)):
# │     status = "OK" if e.task_succeeded else "FAILED"
# │     print(f"\n  Entry {i} [{status}]: {e.task[:50]}")
# │     print(f"    {r.explain()}")
# │     print(f"    Components: L={r.linguistic:.2f} S={r.structural:.2f} O={r.outcome:.2f}")
# └─────────────────────────────────────────────────────────────────────────────

## Theory Backfill: Why a Weighted Linear Combination?

We made two assumptions that are both wrong and both useful:

1. **Linearity**: Signals combine additively. In reality, high structural + high
   linguistic is stronger evidence than either alone (interaction effects).
   But linear is the simplest baseline.

2. **Independence**: The three signals are independent. They're not — vocabulary
   overlap correlates with structural compliance. But treating them as independent
   lets us build and test each detector separately.

Both assumptions break down. That's fine. Module 0.2 builds the probability
foundations to handle more sophisticated models.

# ═══════════════════════════════════════════════════════════════════════════════
# EXERCISES
# ═══════════════════════════════════════════════════════════════════════════════

## Exercise 1: Break the Scorer

Add 3 new `BuildlogEntry` instances to `ENTRIES` that expose failure modes of
the `SalienceScorer`. For each entry, explain:
- What the scorer gets wrong (predicted score vs. your intuitive rating)
- Why it fails (which signal is misleading)
- How you'd fix it (what signal or logic would handle this case)

**Success criteria**: At least one entry where the scorer's ranking disagrees
with intuition by > 0.3.

## Exercise 2: Weight Tuning

Try different weight configurations on the 10 entries:
- `(0.7, 0.2, 0.1)` — linguistic-heavy
- `(0.2, 0.7, 0.1)` — structural-heavy
- `(0.33, 0.33, 0.34)` — uniform

**Success criteria**: Report the Spearman rho for each configuration. Identify
which gives the best correlation. If any configuration achieves rho > 0.9,
explain whether that's likely overfitting to 10 data points (hint: it probably is).

## Exercise 3 [PUBLISH]: Write the "Contains Check Takedown"

Write 500-800 words explaining why substring matching fails for evaluating
agent rule compliance. Target: practitioners building AI agent systems.

Structure:
1. The problem: you have rules, you need to evaluate compliance
2. The naive approach: `contains` / substring matching
3. Three failure modes (use examples from this notebook)
4. The alternative: decompose into linguistic + structural + outcome signals
5. Why this matters for bandit-based rule learning systems

Draft for: *"Here's what everyone gets wrong about evaluating AI agent output"*

In [None]:
# Exercise 3 workspace
draft = """
# Here's What Everyone Gets Wrong About Evaluating AI Agent Output

TODO: Write your draft here.
"""
print(draft)

# ═══════════════════════════════════════════════════════════════════════════════
# OUTRO
# ═══════════════════════════════════════════════════════════════════════════════

## What Just Happened

You started with one broken evaluation — an agent punished for following a rule —
and built a scorer that actually works. Here's the path you took:

1. **Felt the problem** on a single example (`contains` returned 0.0 on perfect compliance)
2. **Built vocabulary matching** to catch what `contains` missed
3. **Expanded to a full test suite** to find edge cases
4. **Added structural checks** to catch false positives (Entry 3)
5. **Added outcome signal** to catch broken implementations (Entries 7-8)
6. **Packaged it** as a `SalienceScorer` with configurable weights
7. **Validated** against your intuitive ratings using Spearman correlation

The formula — `S = w_l * linguistic + w_s * structural + w_o * outcome` — is a
weighted linear combination. It's the simplest model that could work.

You also established three constitutional rules for the entire arc:

1. Every scoring function must have a falsifiable claim
2. Weights must be updatable from data, not hardcoded forever
3. If you can't explain what a score of 0.7 means in plain English, the scorer isn't ready

## Publication Note

Exercise 3 is a draft for *"Here's what everyone gets wrong about evaluating
AI agent output."* Run an edit pass and it's ready to publish.

## What's Next

The weights are hand-picked. Why 0.4/0.4/0.2? Because it felt right. That's
not science. In Module 0.2, we build the probability foundations to make this
rigorous — so the weights can be learned from data instead of vibes.

--> [Module 0.2: Probability & Counting](../module-0.2-probability-counting/0.2-probability-counting-core.ipynb)

# ═══════════════════════════════════════════════════════════════════════════════
# RESOURCES
# ═══════════════════════════════════════════════════════════════════════════════

- **Salience (cognitive science)**: Salience as attention-weighted relevance comes
  from cognitive/perceptual psychology. Our scorer approximates what a human
  reviewer does intuitively.
- **Weighted linear combinations**: Any introductory linear algebra text covers
  why this is the simplest useful model. We'll formalize this in Arc 1.
- **From the archive**: `archive/v1-week-based/notebooks/` — earlier explorations
  of the bandit that this module's scorer feeds into.

### Companion Texts

- **Downey, *Think Stats* (2nd ed.)**: Ch 1-2 — exploratory data analysis foundations
- **Downey, *Think Bayes* (2nd ed.)**: Ch 1 — computational approach to Bayesian thinking
- **Blitzstein & Hwang, *Introduction to Probability***: Ch 1-2 — formal probability
  framework we'll use starting in Module 0.2