# Exercise 1

Coding: Build a simple Pairwise Judgment Evaluator. Use an API like GPT-5 to compare two responses. Implement a swap test: for a given pair (response A vs B), have the judge pick a winner; then present the responses in reversed order (B vs A) and get a winner. If the judge is inconsistent (prefers A then B), record this as a potential bias instance and default to “tie” or require human review. Test this on a few known cases (you can fabricate scenarios, like A is longer but correct, B is shorter but slightly less verbose, etc.). Measure the agreement rate of the judge with itself under swapping.

## Solution

In [None]:
import os

try:
    from openai import OpenAI
except ModuleNotFoundError:
    raise ModuleNotFoundError("Please install the 'openai' Python package.")


OPENAI_API_KEY = ""  # set your key here as a string 
if not OPENAI_API_KEY:
    raise RuntimeError("OPENAI_API_KEY is not set.")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

client = OpenAI()
MODEL = "gpt-4.1-nano"
SYSTEM = (
    "You are a strict, fair judge of response quality. Prioritize correctness, "
    "then clarity and concision."
)


def judge_pair(prompt, a, b):
    user = f"""Prompt:
{prompt}

Response A:
{a}

Response B:
{b}

Return only one token: A, B, or TIE."""

    try:
        resp = client.responses.create(
            model=MODEL,
            input=[
                {"role": "system", "content": SYSTEM},
                {"role": "user", "content": user},
            ],
            temperature=0,
            top_p=1,
            max_output_tokens=200,
        )
    except Exception as e:
        raise RuntimeError(
            "OpenAI API call failed. Common causes: invalid API key, no network, or wrong model name.\n"
            f"Underlying error: {type(e).__name__}: {e}"
        ) from e

    text = ((resp.output_text) or "").strip().upper()
    for token in ("A", "B", "TIE"):
        if text.startswith(token):
            return token

    raise ValueError(
        "Judge returned an unexpected output (expected exactly: A, B, or TIE).\n"
        f"Raw judge output: {resp.output_text!r}"
    )


def swap_test(prompt, a, b):
    r1 = judge_pair(prompt, a, b)
    r2 = judge_pair(prompt, b, a)
    consistent = (r1, r2) in {("A", "B"), ("B", "A"), ("TIE", "TIE")}
    final = "TIE" if not consistent else r1
    return {"first": r1, "second": r2, "consistent": consistent, "final": final}


cases = [
    {
        "name": "Induction vs proof",
        "prompt": "If all swans you have seen are white, does that prove all swans are white? Answer in one sentence.",
        "A": "No. That is inductive evidence, not a proof.",
        "B": "Yes. Observing enough swans proves it.",
    },
    {
        "name": "0.999...",
        "prompt": "Is 0.999... equal to 1? Briefly justify.",
        "A": "Yes. It is the limit of 0.9, 0.99, 0.999, ... which equals 1.",
        "B": "No. It is slightly less than 1.",
    },
    {
        "name": "Derivative of |x| at 0",
        "prompt": "Does f(x)=|x| have a derivative at x=0? Answer concisely.",
        "A": "No. The left and right derivatives are -1 and 1.",
        "B": "Yes. The derivative at 0 is 0.",
    },
    {
        "name": "Verbose vs concise (both correct)",
        "prompt": "In one sentence, describe what a binary search does.",
        "A": "Binary search works on sorted data by repeatedly halving the search interval and checking the middle element to find a target, which is efficient for large lists.",
        "B": "It finds a target in a sorted list by repeatedly halving the search range.",
    },
]


results = []
for c in cases:
    r = swap_test(c["prompt"], c["A"], c["B"])
    r["name"] = c["name"]
    results.append(r)
    flag = " (bias -> tie)" if not r["consistent"] else ""
    print(f"{c['name']}: {r['first']} then {r['second']} -> {r['final']}{flag}")

agree = sum(r["consistent"] for r in results)
rate = agree / len(results)
print(f"Agreement rate: {rate:.0%} ({agree}/{len(results)})")


Induction vs proof: A then B -> A
0.999...: A then B -> A
Derivative of |x| at 0: A then B -> A
Verbose vs concise (both correct): B then B -> TIE (bias -> tie)
Agreement rate: 75% (3/4)


# Exercise 2

Analysis: You have a small set of prompts where you also collected human evaluations of outputs. Use this as a calibration set for your LLM-judge. For each prompt, you have the judge’s preferred answer and the human’s preferred answer. How would you estimate the judge’s error rates (false positive/negative) from this? Outline how to adjust the judge’s scores statistically so that, say, a 70% win rate reported by the judge comes with a confidence interval for the true human preference rate.

## Solution

Start with a human-labeled **gold set**, then split it into two parts:

- **Few-shot subset**: examples you include in the judge prompt.
- **Holdout subset**: examples the judge never sees.

On the holdout, compare judge vs human to estimate error rates (e.g., judge picks A when human prefers B):

- $FP = P(\text{judge picks A} \mid \text{human prefers B})$
- $FN = P(\text{judge picks B} \mid \text{human prefers A})$

If the judge reports a win rate $p_j = P(\text{judge picks A})$ on unlabeled prompts, relate it to the true human preference rate $p_h = P(\text{human prefers A})$ via:

$$
 p_j = p_h(1 - FN) + (1 - p_h)FP
$$

Solve for $p_h$:

$$
 p_h = \frac{p_j - FP}{1 - FP - FN}
$$

Then clip $p_h$ to $[0, 1]$ if needed.

For uncertainty, Wilson/binomial confidence intervals are fast, closed-form, and good for small $n$ under i.i.d. Bernoulli assumptions; bootstrap confidence intervals resample the holdout set and are more flexible, but need more data/compute.


# Exercise 3

Discussion: Imagine our Sqwish system using an LLM judge to decide which of two prompts led to a better answer (when user feedback is not immediately available). What are the risks if we blindly trust the LLM judge’s verdicts? List at least two failure modes (e.g. the judge might prefer flowery language even if factual accuracy suffers). How can the team mitigate these? (Think in terms of periodic human audits, calibration as above, or mixing in some known test queries with correct answers.)

## Solution

**Failure modes**

- **Style/length bias**: the judge prefers verbosity or polished prose over factual accuracy.
- **Order bias**: the verdict changes based on which answer is shown first, or on superficial formatting differences (a swap test can reveal this inconsistency).

**Mitigations**

- **Periodic audits**: regularly sample judgments for human review.
- **Fixed gold set**: maintain a stable set of benchmark queries and track judge performance over time to detect drift.
- **Cross-judge checks**: occasionally compare the production judge against a very strong reasoning model (but more expensive) reasoning model as a proxy signal, and recalibrate when disagreement/error rates rise.
