# LLM as a Judge

So far, we have been manually reviewing the LLM editor's outputs. This has been a relatively smooth process, but it is not scalable, as there are many failure cases we would need to keep track of. Investing in building an LLM judge makes sense at this stage. 

Before deploying an LLM judge, we need to ensure its performance is aligned with that of a human judge. This is critical as we would otherwise risk optimizing the wrong metric.

Let's get started by creating a small human-annotated dataset of reviews. This dataset will later be used to evaluate the performance of our LLM judge. 

### Create an Eval dataset

To ease the process of creating an eval dataset, we built a small utility class, `ReviewApp`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json
import re
from json.decoder import JSONDecodeError
from typing import cast

import pandas as pd
from pydantic import BaseModel

from anki_ai.domain.model import Deck, Note
from anki_ai.entrypoints.review_notes_changes import ReviewApp
from anki_ai.service_layer.services import (
    ChatCompletionsService,
    get_chat_completion,
)

We have collected annotations for over 200 notes. We will use this dataset to evaluate the model's alignment with our preference.

In [3]:
orig_deck = Deck("original")
orig_deck.read_txt("../data/Selected Notes v7.txt")

deck = Deck("edited")
deck.read_txt("../data/new_deck.txt")

In [4]:
ra = ReviewApp(deck=deck)
ra.load("../data/eval.txt")

In [5]:
df_eval = pd.read_csv("../data/eval.txt", sep="\t", header=None)
df_eval.columns = ["guid", "score"]
df_eval.head()

Unnamed: 0,guid,score
0,A$U26>n14?,False
1,c#*tMdp`:C,True
2,hVkGAdktL6,True
3,yyo348j{|9,True
4,N1O$1BYpt$,True


### Create a very simple LLM judge

Let's create a simple LLM judge, and evaluate its alignment with human preference by measuring how well it does on the eval dataset. 

In [6]:
SYSTEM_MSG = r"""
Your job is to evaluate Anki's notes and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes are written in hybrid markdown; for instance: the newline character is `<br>,` `<` is `&lt;`, etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for short commands: e.g., `iw`, `d`, etc.

Provide only a boolean score: False for bad and True for good.
"""


def review_note(note: Note, chat: ChatCompletionsService) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
    )
    result: str = cast(str, chat_response.choices[0].message.content)

    print(user_msg)
    print(f"Eval: {result}")
    return eval(result)

In [7]:
chat = get_chat_completion()

aligned = 0
tot = 0
try:
    for guid, score in ra._ReviewApp__reviews.items():
        note = deck.get(guid=guid)[0]
        pred = review_note(note, chat)
        print(f"Ground Truth: {score}\n")
        if pred == score:
            aligned += 1
        tot += 1
        print("#######################\n")

    print(f"Alignment: {aligned / tot:.2%}")
except SyntaxError as e:
    print(
        f"\nThe LLM did not comply with the prompt and returned something different from True or False: {e}"
    )

Front: Locker
Back: Locker
Tags: ['english']
Eval: False
Ground Truth: False


#######################

Front: Character-level vs word-level tokenization
Back: Character-level tokenizers have much smaller vocabularies
Tags: ['nlp']
Eval: False

Reason: The note is missing a newline character after the front and back fields. It should be formatted as:

Front: Character-level vs word-level tokenization<br>
Back: Character-level tokenizers have much smaller vocabularies<br>
Tags: ['nlp']

The LLM did not comply with the prompt and returned something different from True or False: invalid syntax (<string>, line 3)


This first model doesn't do well on the task. Let's try to improve it by using structured output and few-shot learning. 

### Improve performance of LLM judge

#### Structured output

The LLM judge's performance seems decent, but we should use structured output to make it more manageable and avoid scenarios when the LLM does not follow the instructions properly and returns something other than a boolean. This can happen quite frequently. To address that, let's use structured output.

In [8]:
class Review(BaseModel):
    guid: str
    is_correct: bool

def review_note(note: Note, chat: ChatCompletionsService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")

        return result
    except JSONDecodeError as e:
        print(e)

In [9]:
chat = get_chat_completion()

aligned = 0
tot = 0
for guid, score in ra._ReviewApp__reviews.items():
    note = deck.get(guid=guid)[0]
    pred = review_note(note, chat, verbose=True)
    print(f"Ground Truth: {score}\n")
    if pred.is_correct == eval(score):
        aligned += 1
    tot += 1
    print("#######################\n")

print(f"Alignment: {aligned}/{tot} ({aligned / tot:.2%})")

Front: Locker
Back: Locker
Tags: ['english']
Eval: guid='A$U26>n14?' is_correct=False

Ground Truth: False


#######################

Front: Character-level vs word-level tokenization
Back: Character-level tokenizers have much smaller vocabularies
Tags: ['nlp']
Eval: guid='"c#*tMdp`:C"' is_correct=True

Ground Truth: True


#######################

Front: Chipset PCIe lanes name
Back: PCH lanes
Tags: ['gpu', 'hardware']
Eval: guid='hVkGAdktL6' is_correct=True

Ground Truth: True


#######################

Front: WebSockets vs traditional web communication
Back: HTTP follows a request-response model. WebSockets introduce a full-duplex communication channel
Tags: ['system-design']
Eval: guid='yyo348j{|9' is_correct=True

Ground Truth: True


#######################

Front: Test Time Augmentation
Back: At inference/validation time, create multiple versions of each image using data augmentation, then take the average/max of predictions for each version.
Tags: ['fastai']
Eval: guid='N1O$1BY

#### Few-shot prompting

Some of the answers are incorrect. Let's try to pass a few examples to the LLM judge to see if we can improve on that.

In [10]:
SYSTEM_MSG = r"""
Your job is to evaluate Anki notes, and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes should be in HTML format; for instance: newline should "<br>", "<" should be "&lt;", etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for very short commands: `iw`, `d`, etc.

Examples of good notes:

Example 1:

    Front: Create soft link
    Back:  ```bash<br>$ ln -s <file> <link><br>```
    Tags:  ['linux']

Example 2:

    Front: Zip destination option
    Back:  ```bash<br>$ unzip <file> -d <path><br>```
    Tags:  ['linux']

Example 3:

    Front: Extract zip files
    Back:  ```bash<br>$ unzip <file><br>```
    Tags:  ['linux']

Example 4:

    Front: List directory content
    Back:  ```bash<br>$ ls <path><br>```
    Tags:  ['linux']

Examples of bad notes: 

Example 1:

    Front: Return to previous directory
    Back:  ```bash $ cd -```
    Tags:  ['linux']

    Reasoning: Missing newlines (<br> tags) in code block

Example 2: 

    Front: Remove delimiters
    Back:  ```ds <delimiter>```
    Tags:  ['nvim']

    Reasoning: Using triple backtick quotes without specifying the language and adding newlines (<br> tag) in code block

Example 3: 

    Front: Change Anki delimiters
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Mentioning the command is an Anki command when, in fact, it's a nvim command

Example 4: 

    Front: Text object for a sentence
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Missing command and not closing code block
"""


def review_note(note: Note, chat: ChatCompletionsService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")

        return result
    except JSONDecodeError as e:
        print(e)

In [11]:
chat = get_chat_completion()

aligned = 0
tot = 0
for guid, score in ra._ReviewApp__reviews.items():
    note = deck.get(guid=guid)[0]
    pred = review_note(note, chat, verbose=True)
    print(f"Ground Truth: {score}\n")
    if pred.is_correct == eval(score):
        aligned += 1
    tot += 1
    print("#######################\n")

print(f"Alignment: {aligned}/{tot} ({aligned / tot:.2%})")

Front: Locker
Back: Locker
Tags: ['english']
Eval: guid='A$U26>n14?' is_correct=False

Ground Truth: False


#######################

Front: Character-level vs word-level tokenization
Back: Character-level tokenizers have much smaller vocabularies
Tags: ['nlp']
Eval: guid='"c#*tMdp`:C"' is_correct=True

Ground Truth: True


#######################

Front: Chipset PCIe lanes name
Back: PCH lanes
Tags: ['gpu', 'hardware']
Eval: guid='hVkGAdktL6' is_correct=True

Ground Truth: True


#######################

Front: WebSockets vs traditional web communication
Back: HTTP follows a request-response model. WebSockets introduce a full-duplex communication channel
Tags: ['system-design']
Eval: guid='yyo348j{|9' is_correct=True

Ground Truth: True


#######################

Front: Test Time Augmentation
Back: At inference/validation time, create multiple versions of each image using data augmentation, then take the average/max of predictions for each version.
Tags: ['fastai']
Eval: guid='N1O$1BY

This result is also surprising. We would have expected a few examples to help the model understand what is the expected formatting for these notes. 

A few things we want to try next: 
1. For each type of common error (e.g., double backslash on LaTeX code, code block for math, etc.), provide both a negative and positive example
1. Ask the LLM to provide reasoning

In [12]:
SYSTEM_MSG = r"""
Your job is to evaluate the formatting of Anki note.

Properly formatted notes should:
* Use hybrid markdown format. For instance, use "<br>" to signal a new line, "&lt;" for "<" symbol, etc.
* Preserve images and media on the original note
* Wrap code in a code block: ```<language><br><command><br>```
* Wrap math in a LaTeX block: $ <math equation> $. Also, ensure that we do not use double backslashes, \\, in a LaTeX block, as that won't be correctly displayed
* Wrap short commands in an inline code block: `iw`, `d`, etc.

Provide concise reasoning for your answer and a True/False answer, where True means the note is formatted correctly and False means the note is not properly formatted.
"""

class Review(BaseModel):
    guid: str
    reasoning: str 
    is_correct: bool

def review_note(note: Note, chat: ChatCompletionsService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")

        return result
    except JSONDecodeError as e:
        print(e)

chat = get_chat_completion()
aligned = 0
tot = 0
for guid, score in ra._ReviewApp__reviews.items():
    note = deck.get(guid=guid)[0]
    pred = review_note(note, chat, verbose=True)
    print(f"Ground Truth: {score}\n")
    if pred.is_correct == eval(score):
        aligned += 1
    tot += 1
    print("#######################\n")

print(f"Alignment: {aligned}/{tot} ({aligned / tot:.2%})")

Front: Locker
Back: Locker
Tags: ['english']
Eval: guid='A$U26>n14?' reasoning="The front and back of the note are identical, which is not a good practice in Anki. It's better to have a clear and concise front and a detailed back." is_correct=False

Ground Truth: False


#######################

Front: Character-level vs word-level tokenization
Back: Character-level tokenizers have much smaller vocabularies
Tags: ['nlp']
Eval: guid='"c#*tMdp`:C"' reasoning='The note is properly formatted.' is_correct=True

Ground Truth: True


#######################

Front: Chipset PCIe lanes name
Back: PCH lanes
Tags: ['gpu', 'hardware']
Eval: guid='hVkGAdktL6' reasoning='The note is properly formatted as it uses the required fields and does not contain any formatting issues.' is_correct=True

Ground Truth: True


#######################

Front: WebSockets vs traditional web communication
Back: HTTP follows a request-response model. WebSockets introduce a full-duplex communication channel
Tags: ['sys

Some of the mistakes the LLM judge makes are due to:
* Missing original note, which makes it hard to know if an `<img>` was removed
* Reviewing the format of the tags (??)
* Being lenient with not closing code block―which is technically valid markdown code, so maybe we could let it slide
* 

In [13]:
SYSTEM_MSG = r"""
The user will share two Anki notes: the original and improved versions. The improved version should be factually the same as the original note but more concise and might have a slightly different format.

Your job is to evaluate the formatting of the improved version. Properly formatted notes should:
* Use hybrid markdown format. For instance, when relevant, use `<br>` to signal a new line, `&nbsp;` to signal a non-breaking space, `&lt;` for `<` symbol, etc. Cards with just one sentence that do not include code or math equations do not require special formatting
* Preserve images and media present on the original note
* Code should be wrapped in a code block: ```<language><br><command><br>```. One line command should use an inline code block: `iw`, `d`, `:copen`, etc.
* Mathematical equations should be wrapped in a LaTeX block: $ <math equation> $. Also, ensure that we do not use double backslashes, \\, in a LaTeX block, as that won't be correctly displayed

Provide concise reasoning for your answer and a True/False answer, where True means the improved note is formatted correctly and False means the improved note is not properly formatted.
"""

class Review(BaseModel):
    guid: str
    reasoning: str 
    is_correct: bool

def review_note(note: Note, chat: ChatCompletionsService, verbose=False) -> Note:
    orig = orig_deck.get(note.guid)[0]
    user_msg = f"""Original note:\nFront: {orig.front}\nBack: {orig.back}\nTags: {orig.tags}\nNew note:\nFront: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")

        return result
    except JSONDecodeError as e:
        print(e)

chat = get_chat_completion()
aligned = 0
tot = 0
for guid, score in ra._ReviewApp__reviews.items():
    note = deck.get(guid=guid)[0]
    pred = review_note(note, chat, verbose=True)
    print(f"Ground Truth: {score}\n")
    if pred.is_correct == eval(score):
        aligned += 1
    tot += 1
    print("#######################\n")

print(f"Alignment: {aligned}/{tot} ({aligned / tot:.2%})")

Original note:
Front: "<img src=""paste-bd59a972734fb91c3325b1dec38ea93d47925ab7.jpg"">"
Back: Locker
Tags: ['english']
New note:
Front: Locker
Back: Locker
Tags: ['english']
Eval: guid='A$U26>n14?' reasoning='The improved note is missing the image from the original note. It should preserve the image present on the original note.' is_correct=False

Ground Truth: False


#######################

Original note:
Front: When talking about vocabulary size, what is the main difference between character-level and word-level tokenization?
Back: "Character-level tokenizers have much smaller vocabularies<br><br><div><img src=""paste-ab8f01c4dc695a6f0d8fa93effdca0c77b94398d.jpg""><br></div>"
Tags: ['nlp']
New note:
Front: Character-level vs word-level tokenization
Back: Character-level tokenizers have much smaller vocabularies
Tags: ['nlp']
Eval: guid='"c#*tMdp`:C"' reasoning='The improved note is missing the image and the line breaks. The original note had a clear separation between the two line

### Create helper functions to facilitate reviewing notes

Let's create a `pandas.DataFrame` with both: original note, edited note, and LLM review. This will facilitate our review of the LLM reviews.

In [14]:
dict_data = [item.dict() for item in results]
df_scores = pd.DataFrame(dict_data)
df_scores.head()

NameError: name 'results' is not defined

In [None]:
a = [note.dict() for note in deck]
df_notes = pd.DataFrame(a)
df_notes.head()

In [None]:
x = pd.merge(df_notes, df_scores, how="inner", on="guid")
x = x[x.tags.apply(lambda a: "life" not in a)]  # exclude personal notes
print(x.shape)
x.head(25)

In [None]:
def validate_interactive_session(session_text):
    lines = session_text.strip().split("<br>")
    input_pattern = r"^>>> .*$"
    continuation_pattern = r"^... .*$"
    output_pattern = r"^(?!>>>)(?!\.\.\.)"

    state = "expecting_input"
    for i, line in enumerate(lines, 1):
        if state == "expecting_input":
            if not (
                re.match(input_pattern, line) or re.match(continuation_pattern, line)
            ):
                return False, f"Line {i}: Expected input (>>> or ...), got: {line}"
            state = "optional_output"
        elif state == "optional_output":
            if re.match(input_pattern, line) or re.match(continuation_pattern, line):
                state = "expecting_input"
            elif not re.match(output_pattern, line):
                return False, f"Line {i}: Invalid output format: {line}"

    return True, "Valid interactive session format"


def validate_code_block_format(block):
    # Check if the block starts and ends with ```
    if not (block.startswith("```") and block.endswith("```")):
        return False, "Code block should start and end with ```"

    # Remove the opening and closing ```
    content = block[3:-3].strip()

    # Check if the block starts with a language specifier
    if not re.match(r"^[\w-]+<br>", content):
        return (
            False,
            "Code block should start with a language specifier followed by <br>",
        )

    # Split the content by <br> tags
    lines = content.split("<br>")

    # Check if the last line is empty (as it should end with <br>)
    if lines[-1].strip() != "":
        return False, "Code block should end with <br>"

    # Check if there are any empty lines in between (which would indicate missing <br>)
    if any(line.strip() == "" for line in lines[1:-1]):
        return (
            False,
            "Code block should not have empty lines. Use <br> for line breaks.",
        )

    return True, "Valid code block format"


def validate_hybrid_markdown(content):
    issues = []

    # Check for double backslashes in LaTeX blocks
    latex_blocks = re.findall(r"\$(.*?)\$", content, re.DOTALL)
    for block in latex_blocks:
        if "\\\\" in block:
            issues.append(
                "Double backslash (\\\\) found in LaTeX block. This may cause rendering issues."
            )

    # Check for unmatched dollar signs
    # Split the content into code blocks and non-code blocks
    parts = re.split(r"(```[\s\S]*?```)", content)

    total_dollar_count = 0
    for part in parts:
        if part.startswith("```") and part.endswith("```"):
            # This is a code block
            is_valid, message = validate_code_block_format(part)
            if not is_valid:
                issues.append(f"Invalid code block format: {message}")

            if part.startswith("```python"):
                # Check if it's an interactive Python session
                session_content = part[13:-3].strip()  # Remove ```python<br> and ```
                is_valid, message = validate_interactive_session(session_content)
                if not is_valid:
                    issues.append(
                        f"Invalid Python interactive session in code block: {message}"
                    )
        else:
            # Count dollar signs in non-code block parts
            dollar_count = part.count("$")
            total_dollar_count += dollar_count

    # Check if the total number of dollar signs outside code blocks is odd
    if total_dollar_count % 2 != 0:
        issues.append(
            "Unmatched dollar signs outside code blocks. LaTeX may not render correctly."
        )

    # Check for common Markdown syntax errors
    if "```" in content and content.count("```") % 2 != 0:
        issues.append(
            "Unmatched code block delimiters (```). Code blocks may not render correctly."
        )

    return issues

In [None]:
n_reviews = 100

for row in x.iloc[:n_reviews].iterrows():
    note = row[1]
    print(f"Front: {note['front']}\nBack: {note['back']}\nTags: {note['tags']}")
    for side in ["front", "back"]:
        a = note[side]
        issues = validate_hybrid_markdown(a)
        if issues:
            for issue in issues:
                print(f"Issue {side}: {issue}")
        else:
            print(f"Issue {side}: None")
    print("\n")

Common errors are:

* Missing `<img>`
* Wrong prompt (e.g., `>>`, missing `$`)
* Missing `<br>` inside code block
* Missing `<br>` outside code block
* `\\` in LaTeX
* References (should we remove them?)
* Trailing `.` (full stop)
* Using code block for note that does not contain code
* "```bash" for keymap
* Missing language in code block
* Unmatched code block delimiter (missing trailing "```")
* Missing inline code block for keymap or short commands

### Todo

- [ ] Create a dataset to measure LLM judge's alignment with human preference 
- [ ] Use _reflection_ agentic workflow to improve notes