# LLM as a Judge

### Quickly review outputs of current model to find good samples for an eval dataset

In [1]:
import json
import random
import re
from json.decoder import JSONDecodeError
from typing import cast

import pandas as pd
from pydantic import BaseModel
from tqdm import tqdm

from anki_ai.domain.model import Deck, Note
from anki_ai.service_layer.services import (
    ChatCompletionService,
    get_chat_completion,
)

Let's load the output of the LLM note editor.

In [2]:
deck = Deck("edited")
deck.read_txt("../data/new_deck.txt")

orig_deck = Deck("original")
orig_deck.read_txt("../data/Selected Notes v7.txt")

Let's create an LLM judge to be able to identify common errors we saw during error analysis.

In [3]:
SYSTEM_MSG = r"""
Your job is to evaluate Anki notes, and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes should be in HTML format; for instance: newline should "<br>", "<" should be "&lt;", etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for very short commands: `iw`, `d`, etc.

Examples of good notes:

Example 1:

    Front: Create soft link
    Back:  ```bash<br>$ ln -s <file> <link><br>```
    Tags:  ['linux']

Example 2:

    Front: Zip destination option
    Back:  ```bash<br>$ unzip <file> -d <path><br>```
    Tags:  ['linux']

Example 3:

    Front: Extract zip files
    Back:  ```bash<br>$ unzip <file><br>```
    Tags:  ['linux']

Example 4:

    Front: List directory content
    Back:  ```bash<br>$ ls <path><br>```
    Tags:  ['linux']

Examples of bad notes: 

Example 1:

    Front: Return to previous directory
    Back:  ```bash $ cd -```
    Tags:  ['linux']

    Reasoning: Missing newlines (<br> tags) in code block

Example 2: 

    Front: Remove delimiters
    Back:  ```ds <delimiter>```
    Tags:  ['nvim']

    Reasoning: Using triple backtick quotes without specifying the language and adding newlines (<br> tag) in code block

Example 3: 

    Front: Change Anki delimiters
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Mentioning the command is an Anki command when, in fact, it's a nvim command

Example 4: 

    Front: Text object for a sentence
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Missing command and not closing code block
"""

### Free text

In [4]:
def review_note(note: Note, chat: ChatCompletionService) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
    )
    result: str = cast(str, chat_response.choices[0].message.content)

    print(user_msg)
    print(f"Eval: {result}\n")
    print("#######################\n")

In [5]:
chat = get_chat_completion()
for note in deck[:10]:
    review_note(note, chat)

Front: Headboard
Back: Headboard
Tags: ['english']
Eval: This note is not formatted correctly.

Reasoning: The front and back of the note should contain different information. The front should be a question or a prompt, and the back should be the answer or the information to be remembered. In this case, both the front and back contain the same information, which is "Headboard". 

Additionally, the note is missing a code block or any other relevant information that would make it useful for memorization. 

Corrected note:

Front: What is a headboard?
Back:  ```html<br>A headboard is a piece of furniture that is placed at the head of a bed.<br>```
Tags:  ['english']

#######################

Front: 
Back: Towel
Tags: ['english']
Eval: This note is not formatted correctly.

Reasoning: 
- The front and back of the note should be in HTML format, with the front being a question or a prompt and the back being the answer or the information to be remembered.
- The back of the note is a single wo

### Structured output

In [6]:
class Review(BaseModel):
    guid: str
    is_correct: bool
    reasoning: str


def review_note(note: Note, chat: ChatCompletionService, verbose=False) -> Note:
    user_msg = f"""Front: {note.front}\nBack: {note.back}\nTags: {note.tags}"""

    messages = [
        {"role": "system", "content": SYSTEM_MSG},
        {"role": "user", "content": user_msg},
    ]
    extra_body = {
        "guided_json": Review.model_json_schema(),
        "guided_whitespace_pattern": r"[\n\t ]*",
    }

    chat_response = chat.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        messages=messages,  # type: ignore
        temperature=0,
        extra_body=extra_body,
    )
    content_str: str = cast(str, chat_response.choices[0].message.content)
    try:
        content_dict = json.loads(content_str)
        content_dict["guid"] = note.guid
        updated_content_str = json.dumps(content_dict)
        result = Review.model_validate_json(updated_content_str)

        if verbose:
            print(user_msg)
            print(f"Eval: {result}\n")
            print("#######################\n")

        return result
    except JSONDecodeError as e:
        print(e)

In [7]:
chat = get_chat_completion()
correct_cnt = 0

n = 200
results = []

l = list(deck)
random.shuffle(l)
for note in tqdm(l[:n]):
    result = review_note(note, chat)
    results.append(result)

    if result.is_correct:
        correct_cnt += 1

print(f"{correct_cnt/n:.2%} correct")

100%|██████████████████████████████████████████████| 200/200 [02:57<00:00,  1.13it/s]

53.50% correct





Let's create a `pandas.DataFrame` with both: original note, edited note, and LLM review. This will facilitate our review of the LLM reviews.

In [8]:
dict_data = [item.dict() for item in results]
df_scores = pd.DataFrame(dict_data)
df_scores.head()

Unnamed: 0,guid,is_correct,reasoning
0,BAFW^G^U)m,True,
1,qGYEIXY+>$,False,Missing newlines (<br> tags) in code block
2,"""or8JXq#/99""",True,
3,C>f._kPkT`,False,Missing newlines (<br> tags) in code block
4,"""w7=$k#?Ufv""",False,Missing newlines (<br> tags) in code block and...


In [9]:
a = [note.dict() for note in deck]
df_notes = pd.DataFrame(a)
df_notes.head()

Unnamed: 0,guid,front,back,tags,notetype,deck_name
0,D?H@y-%%r,Headboard,Headboard,[english],KaTeX and Markdown Basic (Color),Default
1,IjfKk}wnb@,,Towel,[english],KaTeX and Markdown Basic (Color),Default
2,"""G1Z_~#;mLc""",,Jug,[english],KaTeX and Markdown Basic (Color),Default
3,"Azd65{j+,q",Create soft link,```bash<br>$ ln -s <file> <link><br>```,[linux],KaTeX and Markdown Basic (Color),Default
4,BGL!8$wV<W,`ln -s` argument order,"File name, then link name",[linux],KaTeX and Markdown Basic (Color),Default


In [10]:
x = pd.merge(df_notes, df_scores, how="inner", on="guid")
x = x[x.tags.apply(lambda a: "life" not in a)]  # exclude personal notes
print(x.shape)
x.head(25)

(198, 8)


Unnamed: 0,guid,front,back,tags,notetype,deck_name,is_correct,reasoning
0,IjfKk}wnb@,,Towel,[english],KaTeX and Markdown Basic (Color),Default,False,Missing front and back content
1,"s=l*N,i*FW",Get command manual/help,```bash<br>$ man <command><br>```,[linux],KaTeX and Markdown Basic (Color),Default,True,
2,"""wa*:15PL#(""",Move/Rename file/dir,```bash<br>$ mv <file/path> <new/file/path><br...,[linux],KaTeX and Markdown Basic (Color),Default,True,
3,EvRnWdyrV6,Terminate stalled processes,```bash<br>$ kill <command><br>```,[linux],KaTeX and Markdown Basic (Color),Default,True,
4,"O`qIqf,Pdf",Return to previous directory,```bash<br>$ cd -<br>```,[linux],KaTeX and Markdown Basic (Color),Default,True,
5,dvE<}!LiBO,L2-norm other name,Euclidean distance,[math],KaTeX and Markdown Basic (Color),Default,True,
6,ouLFegt+x`,L1-norm name,Manhattan distance,[math],KaTeX and Markdown Basic (Color),Default,True,
7,t:e9Hn9Uxa,L1-norm formula,$\|\boldsymbol{x}\|_1 = \sum_{i=1}^n \left|x_i...,[math],KaTeX and Markdown Basic (Color),Default,True,
8,q[Gg^irw5-,Goal of boosting,Reduce bias and variance,[ml],KaTeX and Markdown Basic (Color),Default,True,
9,AQ{]$7/rl[,Parallel boosting,No,[ml],KaTeX and Markdown Basic (Color),Default,False,Missing code block and newlines (<br> tags)


In [11]:
def validate_interactive_session(session_text):
    lines = session_text.strip().split("<br>")
    input_pattern = r"^>>> .*$"
    continuation_pattern = r"^... .*$"
    output_pattern = r"^(?!>>>)(?!\.\.\.)"

    state = "expecting_input"
    for i, line in enumerate(lines, 1):
        if state == "expecting_input":
            if not (
                re.match(input_pattern, line) or re.match(continuation_pattern, line)
            ):
                return False, f"Line {i}: Expected input (>>> or ...), got: {line}"
            state = "optional_output"
        elif state == "optional_output":
            if re.match(input_pattern, line) or re.match(continuation_pattern, line):
                state = "expecting_input"
            elif not re.match(output_pattern, line):
                return False, f"Line {i}: Invalid output format: {line}"

    return True, "Valid interactive session format"


def validate_code_block_format(block):
    # Check if the block starts and ends with ```
    if not (block.startswith("```") and block.endswith("```")):
        return False, "Code block should start and end with ```"

    # Remove the opening and closing ```
    content = block[3:-3].strip()

    # Check if the block starts with a language specifier
    if not re.match(r"^[\w-]+<br>", content):
        return (
            False,
            "Code block should start with a language specifier followed by <br>",
        )

    # Split the content by <br> tags
    lines = content.split("<br>")

    # Check if the last line is empty (as it should end with <br>)
    if lines[-1].strip() != "":
        return False, "Code block should end with <br>"

    # Check if there are any empty lines in between (which would indicate missing <br>)
    if any(line.strip() == "" for line in lines[1:-1]):
        return (
            False,
            "Code block should not have empty lines. Use <br> for line breaks.",
        )

    return True, "Valid code block format"


def validate_hybrid_markdown(content):
    issues = []

    # Check for double backslashes in LaTeX blocks
    latex_blocks = re.findall(r"\$(.*?)\$", content, re.DOTALL)
    for block in latex_blocks:
        if "\\\\" in block:
            issues.append(
                "Double backslash (\\\\) found in LaTeX block. This may cause rendering issues."
            )

    # Check for unmatched dollar signs
    # Split the content into code blocks and non-code blocks
    parts = re.split(r"(```[\s\S]*?```)", content)

    total_dollar_count = 0
    for part in parts:
        if part.startswith("```") and part.endswith("```"):
            # This is a code block
            is_valid, message = validate_code_block_format(part)
            if not is_valid:
                issues.append(f"Invalid code block format: {message}")

            if part.startswith("```python"):
                # Check if it's an interactive Python session
                session_content = part[13:-3].strip()  # Remove ```python<br> and ```
                is_valid, message = validate_interactive_session(session_content)
                if not is_valid:
                    issues.append(
                        f"Invalid Python interactive session in code block: {message}"
                    )
        else:
            # Count dollar signs in non-code block parts
            dollar_count = part.count("$")
            total_dollar_count += dollar_count

    # Check if the total number of dollar signs outside code blocks is odd
    if total_dollar_count % 2 != 0:
        issues.append(
            "Unmatched dollar signs outside code blocks. LaTeX may not render correctly."
        )

    # Check for common Markdown syntax errors
    if "```" in content and content.count("```") % 2 != 0:
        issues.append(
            "Unmatched code block delimiters (```). Code blocks may not render correctly."
        )

    return issues

In [12]:
n_reviews = 100

for row in x.iloc[:n_reviews].iterrows():
    note = row[1]
    print(f"Front: {note['front']}\nBack: {note['back']}\nTags: {note['tags']}")
    for side in ["front", "back"]:
        a = note[side]
        issues = validate_hybrid_markdown(a)
        if issues:
            for issue in issues:
                print(f"Issue {side}: {issue}")
        else:
            print(f"Issue {side}: None")
    print("\n")

Front: 
Back: Towel
Tags: ['english']
Issue front: None
Issue back: None


Front: Get command manual/help
Back: ```bash<br>$ man <command><br>```
Tags: ['linux']
Issue front: None
Issue back: None


Front: Move/Rename file/dir
Back: ```bash<br>$ mv <file/path> <new/file/path><br>```
Tags: ['linux']
Issue front: None
Issue back: None


Front: Terminate stalled processes
Back: ```bash<br>$ kill <command><br>```
Tags: ['linux']
Issue front: None
Issue back: None


Front: Return to previous directory
Back: ```bash<br>$ cd -<br>```
Tags: ['linux']
Issue front: None
Issue back: None


Front: L2-norm other name
Back: Euclidean distance
Tags: ['math']
Issue front: None
Issue back: None


Front: L1-norm name
Back: Manhattan distance
Tags: ['math']
Issue front: None
Issue back: None


Front: L1-norm formula
Back: $\|\boldsymbol{x}\|_1 = \sum_{i=1}^n \left|x_i\right|$
Tags: ['math']
Issue front: None
Issue back: None


Front: Goal of boosting
Back: Reduce bias and variance
Tags: ['ml']
Issue fron

Common errors are:

* Missing `<img>`
* Wrong prompt (e.g., `>>`, missing `$`)
* Missing `<br>` inside code block
* Missing `<br>` outside code block
* `\\` in LaTeX
* References (should we remove them?)
* Trailing `.` (full stop)
* Using code block for note that does not contain code
* "```bash" for keymap
* Missing language in code block
* Unmatched code block delimiter (missing trailing "```")
* Missing inline code block for keymap or short commands

### Todo

- [ ] Create a dataset to measure LLM judge's alignment with human preference 
- [ ] Use _reflection_ agentic workflow to improve notes