# LLM as a Judge

So far we have been manually reviewing the LLM editor's outputs. This has been a relatively decent process, but it is not scalable as there are lots of failure cases we would neet to keep track. Investing in building an LLM judge makes sense at this stage. 

Before deploying an LLM judge, we need to ensure its performance is aligned with that of a human judge. This is critical as we would otherwise risk optimizing the wrong metric.

Let's get started by creating a small human-annotated dataset of reviews. This dataset will later be used to evaluate the performance of our LLM judge. 

### Create an Eval dataset

In [1]:
import json
import random
import re
from json.decoder import JSONDecodeError
from pathlib import Path
from typing import cast

import pandas as pd
from pydantic import BaseModel
from tqdm import tqdm

from anki_ai.domain.model import Deck, Note
from anki_ai.service_layer.services import (
    ChatCompletionsService,
    get_chat_completion,
)

In [2]:
class Review:
    def __init__(self, human_or_llm: str, deck: Deck):
        self.human_or_llm = human_or_llm
        self.deck = deck


class ReviewApp:
    def __init__(self, deck: Deck):
        self.deck = deck
        self.__reviews: list = []
    
    def _get_boolean_input(self, prompt):
        while True:
            response = input(prompt).strip().lower()
            if response in ('y', 'yes', 'true', '1'):
                return True
            elif response in ('n', 'no', 'false', '0'):
                return False
            elif response == 's':
                return None
            elif response == 'q':
                return 'quit'
            else:
                print("Invalid input. Please enter Y (Yes), N (No), S (Skip), or Q (Quit).")

    def review(self):
        self.__reviews = []
        for i, note in enumerate(self.deck):
            print(f"\nCard {i+1} of {len(self.deck)}")
            print(f"Front: {note.front}\nBack: {note.back}\n")
            
            prompt = "Is it correct? (Y/N/S/Q) - Y: Yes, N: No, S: Skip, Q: Quit: "
            response = self._get_boolean_input(prompt)
            
            if response == 'quit':
                print("Exiting review. Progress saved.")
                break
            elif response is None:
                print("Skipping this card.")
                self.__reviews.append(None)
            else:
                self.__reviews.append(response)
            
            print("\n")
        
        self.save_progress()
    
    def save_progress(self):
        pass
    
    def resume_review(self):
        pass

    def save(self, fpath: Path):
        pass
        
    def load(self, fpath: Path):
        pass
    
    def accuracy(self):
        if len(self.deck) == len(self.__reviews):
            print(f"Accuracy: {sum(self.__reviews) / len(self.__reviews):.2%}")
        else: 
            print("Dataset not fully annotated. Can't compute accuracy yet.")

In [3]:
deck = Deck("edited")
deck.read_txt("../data/new_deck.txt")
random.shuffle(deck._Deck__collection)  # TODO: create method to do this
ra = ReviewApp(deck[:5])

In [4]:
ra.review()


Card 1 of 5
Front: List tables in a schema (psql)
Back: (d)isplay (t)ables\n\n```psql\npostgres=# \dt <schema name>.*\n```



Is it correct? (Y/N/S/Q) - Y: Yes, N: No, S: Skip, Q: Quit:  Y





Card 2 of 5
Front: Decent initial loss estimate
Back: The expected loss resulting from normally distributed logits ```python<br>>> logits = torch.randn(4)<br>>> y_true = torch.tensor([0., 0., 1., 0.])<br>>> loss = F.cross_entropy(input=logits, target=y_true)<br>>> print(f''logits: {logits}\\nloss: {loss}''')<br>logits: tensor([0.1300, 0.4489, 0.2878, 0.3832])<br>loss: 1.4180395603179932<br>```



Is it correct? (Y/N/S/Q) - Y: Yes, N: No, S: Skip, Q: Quit:  Y





Card 3 of 5
Front: Move window to far left
Back: `<C+wH>`\n[:help window-moving](<link>)



Is it correct? (Y/N/S/Q) - Y: Yes, N: No, S: Skip, Q: Quit:  Y





Card 4 of 5
Front: Tensor dot product
Back: ```python<br>>> torch.dot(torch.tensor([2, 3]), torch.tensor([2, 1]))<br>tensor(7)<br>```



Is it correct? (Y/N/S/Q) - Y: Yes, N: No, S: Skip, Q: Quit:  N





Card 5 of 5
Front: Compilation error handling with `Result`
Back: The compiler will warn us that the program is not handling a possible error



Is it correct? (Y/N/S/Q) - Y: Yes, N: No, S: Skip, Q: Quit:  N






In [9]:
ra.accuracy()

Accuracy: 60.00%


Now that we have collected some human feedback, we can build an LLM judge, and measure how well it does in mimicking a human judge.

### Create a very simple LLM judge

Let's create a simple LLM judge to be able to identify common errors we saw during initial error analysis. We are not too concerned with the error rate of the LLM editor at this stage, since we do not trust the performance of the LLM judge yet.

In [10]:
SYSTEM_MSG = r"""
Your job is to evaluate Anki's notes and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes are written in hybrid markdown; for instance: the newline character is `<br>,` `<` is `&lt;`, etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for short commands: e.g., `iw`, `d`, etc.

Provide both a boolean score (False for bad and True for good) and a concise reasoning for your assessment of the note.
"""

def review_note(note: Note, chat: ChatCompletionsService) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
    )
    result: str = cast(str, chat_response.choices[0].message.content)

    print(user_msg)
    print(f"Eval: {result}\n")
    print("#######################\n")

chat = get_chat_completion()
for note in deck[:10]:
    review_note(note, chat)

Front: List tables in a schema (psql)
Back: (d)isplay (t)ables\n\n```psql\npostgres=# \dt <schema name>.*\n```
Tags: ['postgres']
Eval: Boolean score: True
Reasoning: The note is well-formatted, using the correct hybrid markdown syntax and code block for the SQL command. The tags are also correctly formatted.

#######################

Front: Decent initial loss estimate
Back: The expected loss resulting from normally distributed logits ```python<br>>> logits = torch.randn(4)<br>>> y_true = torch.tensor([0., 0., 1., 0.])<br>>> loss = F.cross_entropy(input=logits, target=y_true)<br>>> print(f''logits: {logits}\\nloss: {loss}''')<br>logits: tensor([0.1300, 0.4489, 0.2878, 0.3832])<br>loss: 1.4180395603179932<br>```
Tags: ['dl', 'karpathy']
Eval: **Assessment:** True

**Reasoning:** The note is well-formatted. It uses the correct hybrid markdown syntax, and the code block is properly formatted with the language specified as Python. The inline code is also correctly formatted. The note incl

#### Structured output

Performance seems decent, but we should use structured output to have something more manageable to work with. 

In [None]:
class Review(BaseModel):
    guid: str
    is_correct: bool
    reasoning: str


def review_note(note: Note, chat: ChatCompletionService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")
            print("#######################\n")

        return result
    except JSONDecodeError as e:
        print(e)

In [None]:
chat = get_chat_completion()
correct_cnt = 0

n = 200
results = []

l = list(deck)
random.shuffle(l)
for note in tqdm(l[:n]):
    result = review_note(note, chat)
    results.append(result)

    if result.is_correct:
        correct_cnt += 1

print(f"{correct_cnt/n:.2%} correct")

In [11]:
class Review(BaseModel):
    guid: str
    is_correct: bool
    reasoning: str


def review_note(note: Note, chat: ChatCompletionsService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")
            print("#######################\n")

        return result
    except JSONDecodeError as e:
        print(e)

In [12]:
chat = get_chat_completion()
for note in deck[:10]:
    review_note(note, chat, verbose=True)

Front: List tables in a schema (psql)
Back: (d)isplay (t)ables\n\n```psql\npostgres=# \dt <schema name>.*\n```
Tags: ['postgres']
Eval: guid='yFerb_pv<{' is_correct=True reasoning='The note is formatted correctly with a clear front and back, and the code block is properly enclosed.'

#######################

Front: Decent initial loss estimate
Back: The expected loss resulting from normally distributed logits ```python<br>>> logits = torch.randn(4)<br>>> y_true = torch.tensor([0., 0., 1., 0.])<br>>> loss = F.cross_entropy(input=logits, target=y_true)<br>>> print(f''logits: {logits}\\nloss: {loss}''')<br>logits: tensor([0.1300, 0.4489, 0.2878, 0.3832])<br>loss: 1.4180395603179932<br>```
Tags: ['dl', 'karpathy']
Eval: guid='zCSymgGwhf' is_correct=True reasoning='The note is well-formatted with proper markdown and code blocks. The code is written in Python and uses the PyTorch library, which is relevant to the topic. The inline code format is used correctly for short commands like `F.cros

### Improve performance of LLM judge

#### Few-shot prompting

Some of the answers are incorrect. Let's try to pass a few examples to the LLM judge to see if we can improve on that.

In [13]:
SYSTEM_MSG = r"""
Your job is to evaluate Anki notes, and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes should be in HTML format; for instance: newline should "<br>", "<" should be "&lt;", etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for very short commands: `iw`, `d`, etc.

Examples of good notes:

Example 1:

    Front: Create soft link
    Back:  ```bash<br>$ ln -s <file> <link><br>```
    Tags:  ['linux']

Example 2:

    Front: Zip destination option
    Back:  ```bash<br>$ unzip <file> -d <path><br>```
    Tags:  ['linux']

Example 3:

    Front: Extract zip files
    Back:  ```bash<br>$ unzip <file><br>```
    Tags:  ['linux']

Example 4:

    Front: List directory content
    Back:  ```bash<br>$ ls <path><br>```
    Tags:  ['linux']

Examples of bad notes: 

Example 1:

    Front: Return to previous directory
    Back:  ```bash $ cd -```
    Tags:  ['linux']

    Reasoning: Missing newlines (<br> tags) in code block

Example 2: 

    Front: Remove delimiters
    Back:  ```ds <delimiter>```
    Tags:  ['nvim']

    Reasoning: Using triple backtick quotes without specifying the language and adding newlines (<br> tag) in code block

Example 3: 

    Front: Change Anki delimiters
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Mentioning the command is an Anki command when, in fact, it's a nvim command

Example 4: 

    Front: Text object for a sentence
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Missing command and not closing code block
"""

def review_note(note: Note, chat: ChatCompletionsService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")
            print("#######################\n")

        return result
    except JSONDecodeError as e:
        print(e)

chat = get_chat_completion()
for note in deck[:10]:
    review_note(note, chat, verbose=True)

Front: List tables in a schema (psql)
Back: (d)isplay (t)ables\n\n```psql\npostgres=# \dt <schema name>.*\n```
Tags: ['postgres']
Eval: guid='yFerb_pv<{' is_correct=True reasoning=''

#######################

Front: Decent initial loss estimate
Back: The expected loss resulting from normally distributed logits ```python<br>>> logits = torch.randn(4)<br>>> y_true = torch.tensor([0., 0., 1., 0.])<br>>> loss = F.cross_entropy(input=logits, target=y_true)<br>>> print(f''logits: {logits}\\nloss: {loss}''')<br>logits: tensor([0.1300, 0.4489, 0.2878, 0.3832])<br>loss: 1.4180395603179932<br>```
Tags: ['dl', 'karpathy']
Eval: guid='zCSymgGwhf' is_correct=True reasoning=''

#######################

Front: Move window to far left
Back: `<C+wH>`\n[:help window-moving](<link>)
Tags: ['nvim']
Eval: guid='ews+x?%*V&' is_correct=True reasoning=''

#######################

Front: Tensor dot product
Back: ```python<br>>> torch.dot(torch.tensor([2, 3]), torch.tensor([2, 1]))<br>tensor(7)<br>```
Tags: ['p

In [14]:
chat = get_chat_completion()
correct_cnt = 0

n = 200
results = []

random.shuffle(deck._Deck__collection) 
for note in tqdm(deck[:n]):
    result = review_note(note, chat)
    results.append(result)

    if result.is_correct:
        correct_cnt += 1

print(f"{correct_cnt/n:.2%} correct")

100%|██████████████████████████████████████████████| 200/200 [02:57<00:00,  1.13it/s]

54.00% correct





### Create some helper functions to facilitate reviewing notes

Let's create a `pandas.DataFrame` with both: original note, edited note, and LLM review. This will facilitate our review of the LLM reviews.

In [15]:
dict_data = [item.dict() for item in results]
df_scores = pd.DataFrame(dict_data)
df_scores.head()

Unnamed: 0,guid,is_correct,reasoning
0,o6aT]GQw](,True,
1,N}iOTSdKs0,False,Missing newlines (<br> tags) in code block
2,s>XEwKc=B.,False,Missing newlines (<br> tags) in code block
3,"N@8p,&7IfE",True,
4,i&S=]gfn{M,True,


In [16]:
a = [note.dict() for note in deck]
df_notes = pd.DataFrame(a)
df_notes.head()

Unnamed: 0,guid,front,back,tags,notetype,deck_name
0,o6aT]GQw](,High-dimensional space issue,"Everything becomes close, even by chance, due ...",[ml],KaTeX and Markdown Basic (Color),Default
1,N}iOTSdKs0,RL fine-tuning output,A sequence of text (or the probability distrib...,[llm],KaTeX and Markdown Basic (Color),Default
2,s>XEwKc=B.,Darker roasted water temperature,On the lower end of the $195^{\circ}-205^{\\ci...,[espresso],KaTeX and Markdown Basic (Color),Default
3,"N@8p,&7IfE",Show virtualenv path,```bash<br>$ pyenv which python<br>```,[pyenv],KaTeX and Markdown Basic (Color),Default
4,i&S=]gfn{M,Return individual sample loss,"Setting `reduction=""none""` in BCELoss: ```pyth...",[pytorch],KaTeX and Markdown Basic (Color),Default


In [17]:
x = pd.merge(df_notes, df_scores, how="inner", on="guid")
x = x[x.tags.apply(lambda a: "life" not in a)]  # exclude personal notes
print(x.shape)
x.head(25)

(200, 8)


Unnamed: 0,guid,front,back,tags,notetype,deck_name,is_correct,reasoning
0,o6aT]GQw](,High-dimensional space issue,"Everything becomes close, even by chance, due ...",[ml],KaTeX and Markdown Basic (Color),Default,True,
1,N}iOTSdKs0,RL fine-tuning output,A sequence of text (or the probability distrib...,[llm],KaTeX and Markdown Basic (Color),Default,False,Missing newlines (<br> tags) in code block
2,s>XEwKc=B.,Darker roasted water temperature,On the lower end of the $195^{\circ}-205^{\\ci...,[espresso],KaTeX and Markdown Basic (Color),Default,False,Missing newlines (<br> tags) in code block
3,"N@8p,&7IfE",Show virtualenv path,```bash<br>$ pyenv which python<br>```,[pyenv],KaTeX and Markdown Basic (Color),Default,True,
4,i&S=]gfn{M,Return individual sample loss,"Setting `reduction=""none""` in BCELoss: ```pyth...",[pytorch],KaTeX and Markdown Basic (Color),Default,True,
5,"""jP(562:`#9""",fzf suffix-exact match,.mp3$,"[fzf, nvim]",KaTeX and Markdown Basic (Color),Default,False,Missing code block and newlines (<br> tags)
6,BI&~!0T(|),Summarization task,Generate short version of long text with relev...,[nlp],KaTeX and Markdown Basic (Color),Default,True,
7,N$uCAcCvbb,What is the purpose of the logarithmic function?,"""Converts inputs in the range $ [0, +\infty] $...",[dl],KaTeX and Markdown Basic (Color),Default,True,
8,zK~Wk7y4fv,Unrealized gains in Tax-Free account,Money grows tax-free,[finance],KaTeX and Markdown Basic (Color),Default,False,Missing newlines (<br> tags) in code block
9,q6*_Ig/A]n,Perplexity,Low perplexity,[ml],KaTeX and Markdown Basic (Color),Default,False,Missing newlines (<br> tags) in code block


In [18]:
def validate_interactive_session(session_text):
    lines = session_text.strip().split("<br>")
    input_pattern = r"^>>> .*$"
    continuation_pattern = r"^... .*$"
    output_pattern = r"^(?!>>>)(?!\.\.\.)"

    state = "expecting_input"
    for i, line in enumerate(lines, 1):
        if state == "expecting_input":
            if not (
                re.match(input_pattern, line) or re.match(continuation_pattern, line)
            ):
                return False, f"Line {i}: Expected input (>>> or ...), got: {line}"
            state = "optional_output"
        elif state == "optional_output":
            if re.match(input_pattern, line) or re.match(continuation_pattern, line):
                state = "expecting_input"
            elif not re.match(output_pattern, line):
                return False, f"Line {i}: Invalid output format: {line}"

    return True, "Valid interactive session format"


def validate_code_block_format(block):
    # Check if the block starts and ends with ```
    if not (block.startswith("```") and block.endswith("```")):
        return False, "Code block should start and end with ```"

    # Remove the opening and closing ```
    content = block[3:-3].strip()

    # Check if the block starts with a language specifier
    if not re.match(r"^[\w-]+<br>", content):
        return (
            False,
            "Code block should start with a language specifier followed by <br>",
        )

    # Split the content by <br> tags
    lines = content.split("<br>")

    # Check if the last line is empty (as it should end with <br>)
    if lines[-1].strip() != "":
        return False, "Code block should end with <br>"

    # Check if there are any empty lines in between (which would indicate missing <br>)
    if any(line.strip() == "" for line in lines[1:-1]):
        return (
            False,
            "Code block should not have empty lines. Use <br> for line breaks.",
        )

    return True, "Valid code block format"


def validate_hybrid_markdown(content):
    issues = []

    # Check for double backslashes in LaTeX blocks
    latex_blocks = re.findall(r"\$(.*?)\$", content, re.DOTALL)
    for block in latex_blocks:
        if "\\\\" in block:
            issues.append(
                "Double backslash (\\\\) found in LaTeX block. This may cause rendering issues."
            )

    # Check for unmatched dollar signs
    # Split the content into code blocks and non-code blocks
    parts = re.split(r"(```[\s\S]*?```)", content)

    total_dollar_count = 0
    for part in parts:
        if part.startswith("```") and part.endswith("```"):
            # This is a code block
            is_valid, message = validate_code_block_format(part)
            if not is_valid:
                issues.append(f"Invalid code block format: {message}")

            if part.startswith("```python"):
                # Check if it's an interactive Python session
                session_content = part[13:-3].strip()  # Remove ```python<br> and ```
                is_valid, message = validate_interactive_session(session_content)
                if not is_valid:
                    issues.append(
                        f"Invalid Python interactive session in code block: {message}"
                    )
        else:
            # Count dollar signs in non-code block parts
            dollar_count = part.count("$")
            total_dollar_count += dollar_count

    # Check if the total number of dollar signs outside code blocks is odd
    if total_dollar_count % 2 != 0:
        issues.append(
            "Unmatched dollar signs outside code blocks. LaTeX may not render correctly."
        )

    # Check for common Markdown syntax errors
    if "```" in content and content.count("```") % 2 != 0:
        issues.append(
            "Unmatched code block delimiters (```). Code blocks may not render correctly."
        )

    return issues

In [19]:
n_reviews = 100

for row in x.iloc[:n_reviews].iterrows():
    note = row[1]
    print(f"Front: {note['front']}\nBack: {note['back']}\nTags: {note['tags']}")
    for side in ["front", "back"]:
        a = note[side]
        issues = validate_hybrid_markdown(a)
        if issues:
            for issue in issues:
                print(f"Issue {side}: {issue}")
        else:
            print(f"Issue {side}: None")
    print("\n")

Front: High-dimensional space issue
Back: Everything becomes close, even by chance, due to similar values on many coordinates.
Tags: ['ml']
Issue front: None
Issue back: None


Front: RL fine-tuning output
Back: A sequence of text (or the probability distributions over the text)
Tags: ['llm']
Issue front: None
Issue back: None


Front: Darker roasted water temperature
Back: On the lower end of the $195^{\circ}-205^{\\circ}\\[F]$ range
Tags: ['espresso']
Issue front: None
Issue back: Double backslash (\\) found in LaTeX block. This may cause rendering issues.


Front: Show virtualenv path
Back: ```bash<br>$ pyenv which python<br>```
Tags: ['pyenv']
Issue front: None
Issue back: None


Front: Return individual sample loss
Back: Setting `reduction="none"` in BCELoss: ```python<br>trgts = torch.tensor([1., 0. , 1. ])<br>prds  = torch.tensor([1., 0.4, 0.2])<br>l = torch.nn.BCELoss(reduction="none")<br>l(trgts, prds)<br>```
Tags: ['pytorch']
Issue front: None
Issue back: Invalid Python inter

Common errors are:

* Missing `<img>`
* Wrong prompt (e.g., `>>`, missing `$`)
* Missing `<br>` inside code block
* Missing `<br>` outside code block
* `\\` in LaTeX
* References (should we remove them?)
* Trailing `.` (full stop)
* Using code block for note that does not contain code
* "```bash" for keymap
* Missing language in code block
* Unmatched code block delimiter (missing trailing "```")
* Missing inline code block for keymap or short commands

### Todo

- [ ] Create a dataset to measure LLM judge's alignment with human preference 
- [ ] Use _reflection_ agentic workflow to improve notes