# LLM as a Judge

So far, we have been manually reviewing the LLM editor's outputs. This has been a relatively smooth process, but it is not scalable, as there are many failure cases we would need to keep track of. Investing in building an LLM judge makes sense at this stage. 

Before deploying an LLM judge, we need to ensure its performance is aligned with that of a human judge. This is critical as we would otherwise risk optimizing the wrong metric.

Let's get started by creating a small human-annotated dataset of reviews. This dataset will later be used to evaluate the performance of our LLM judge. 

### Create an Eval dataset

To ease the process of creating an eval dataset, we built a small utility class, `ReviewApp`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json
import re
from json.decoder import JSONDecodeError
from typing import cast

import pandas as pd
from pydantic import BaseModel

from anki_ai.domain.model import Deck
from anki_ai.entrypoints.review_notes_changes import ReviewApp
from anki_ai.service_layer.services import (
    ChatCompletionsService,
    get_chat_completion,
)

We have collected annotations for over 200 notes. We will use this dataset to evaluate the model's alignment with our preference.

In [3]:
orig_deck = Deck("original")
orig_deck.read_txt("../data/Selected Notes v7.txt")

deck = Deck("edited")
deck.read_txt("../data/new_deck.txt")

In [4]:
ra = ReviewApp(old_deck=orig_deck, new_deck=deck)
ra.load("../data/eval.txt")

### Create a very simple LLM judge

Let's create a simple LLM judge, and evaluate its alignment with human preference by measuring how well it does on the eval dataset. 

In [5]:
SYSTEM_MSG_V1 = r"""
Your job is to evaluate Anki's notes and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes are written in hybrid markdown; for instance: the newline character is `<br>,` `<` is `&lt;`, etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for short commands: e.g., `iw`, `d`, etc.

Provide only a boolean score: False for bad and True for good.
"""

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"

from collections import namedtuple

from jinja2 import Template


class LLMJudge:
    def __init__(
        self,
        chat: ChatCompletionsService,
        system_msg: str,
        user_msg_tmpl: str,
        model_name: str = "meta-llama/Meta-Llama-3.1-8B-Instruct",
    ) -> None:
        self.chat = chat
        self.system_msg = system_msg
        self.user_msg_tmpl = Template(user_msg_tmpl)
        self.model_name = model_name

    def review(self, note) -> bool:
        user_msg = self.user_msg_tmpl.render(note=note)
        messages = [
            {"role": "system", "content": self.system_msg},
            {"role": "user", "content": user_msg},
        ]
        chat_response = self.chat.create(
            model=self.model_name,
            messages=messages,  # type: ignore
            temperature=0,
        )
        result: str = cast(str, chat_response.choices[0].message.content)
        return Verdict(is_correct=eval(result))


user_msg_tmpl = """Front: {{ note.front }}
Back: {{ note.back }}
Tags: {{ note.tags }}
"""

chat = get_chat_completion()
judge = LLMJudge(chat=chat, system_msg=SYSTEM_MSG_V1, user_msg_tmpl=user_msg_tmpl)
Verdict = namedtuple("Verdict", ["is_correct"])


def review_notes(ra, judge):
    aligned = 0
    tot = 0
    for guid, score in ra._ReviewApp__reviews.items():
        note = deck.get(guid=guid)[0]
        try:
            verdict = judge.review(note)
            if verdict.is_correct == eval(score):
                aligned += 1
            tot += 1
            print(f"Ground Truth: {eval(score)}\nVerdict: {verdict}")
        except SyntaxError as e:
            print(
                f"The LLM did not comply with the prompt and returned something different from True or False: {e}"
            )
            print("#######################")
        else:
            print("#######################")
    print(f"Alignment: {aligned}/{tot} ({aligned / tot:.2%})")


review_notes(ra=ra, judge=judge)

Ground Truth: True
Verdict: Verdict(is_correct=True)
#######################
Ground Truth: True
Verdict: Verdict(is_correct=True)
#######################
Ground Truth: True
Verdict: Verdict(is_correct=True)
#######################
The LLM did not comply with the prompt and returned something different from True or False: invalid syntax (<string>, line 3)
#######################
Ground Truth: True
Verdict: Verdict(is_correct=True)
#######################
Ground Truth: False
Verdict: Verdict(is_correct=True)
#######################
Ground Truth: True
Verdict: Verdict(is_correct=True)
#######################
Ground Truth: False
Verdict: Verdict(is_correct=False)
#######################
The LLM did not comply with the prompt and returned something different from True or False: invalid syntax (<string>, line 3)
#######################
Ground Truth: True
Verdict: Verdict(is_correct=True)
#######################
The LLM did not comply with the prompt and returned something different from True

This first model is not that bad; however, quite frequently, it does not follow the instructions and returns something other than a boolean. Let's fix that by using structured output and improving performance with some prompt engineering.

### Improve performance of LLM judge

#### Structured output

The LLM judge's performance seems decent, but we should use structured output to make it more manageable and avoid scenarios when the LLM does not follow the instructions properly and returns something other than a boolean. This can happen quite frequently. To address that, let's use structured output.

In [6]:
class Review(BaseModel):
    guid: str
    is_correct: bool


class LLMJudgeJSON:
    def __init__(
        self,
        chat: ChatCompletionsService,
        system_msg: str,
        user_msg_tmpl: str,
        model_name: str = "meta-llama/Meta-Llama-3.1-8B-Instruct",
        review_model=Review,
    ) -> None:
        self.chat = chat
        self.system_msg = system_msg
        self.user_msg_tmpl = Template(user_msg_tmpl)
        self.model_name = model_name
        self.review_model = review_model

    def review(self, note) -> bool:
        user_msg = self.user_msg_tmpl.render(note=note)
        messages = [
            {"role": "system", "content": self.system_msg},
            {"role": "user", "content": user_msg},
        ]
        extra_body = {
            "guided_json": self.review_model.model_json_schema(),
            "guided_whitespace_pattern": r"[\n\t ]*",
        }

        chat_response = chat.create(
            model=MODEL_NAME,
            messages=messages,  # type: ignore
            temperature=0,
            extra_body=extra_body,
        )
        content_str: str = cast(str, chat_response.choices[0].message.content)
        try:
            content_dict = json.loads(content_str)
            content_dict["guid"] = note.guid
            updated_content_str = json.dumps(content_dict)
            result = self.review_model.model_validate_json(updated_content_str)
            return result
        except JSONDecodeError as e:
            print(e)


judge_json = LLMJudgeJSON(
    chat=chat, system_msg=SYSTEM_MSG_V1, user_msg_tmpl=user_msg_tmpl
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' is_correct=True
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' is_correct=True
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' is_correct=True
#######################
Ground Truth: False
Verdict: guid='I@*6RLEsm]' is_correct=True
#######################
Ground Truth: True
Verdict: guid='v5I1<L+^4k' is_correct=True
#######################
Ground Truth: False
Verdict: guid='if[&q~T8?V' is_correct=True
#######################
Ground Truth: True
Verdict: guid='L~{D]VMy.2' is_correct=True
#######################
Ground Truth: False
Verdict: guid='HhbnT=2&rE' is_correct=True
#######################
Ground Truth: False
Verdict: guid='"fmc=2!5#q6"' is_correct=True
#######################
Ground Truth: True
Verdict: guid='gy}T)rqdHN' is_correct=True
#######################
Ground Truth: True
Verdict: guid='w`zo)Q_qy+' is_correct=True
#######################
Ground Truth: True
Verdict: guid='IT@~i{IUdV' 

#### Few-shot prompting

Some of the answers are incorrect. Let's try to pass a few examples to the LLM judge to see if we can improve on that.

In [7]:
SYSTEM_MSG_V2 = r"""
Your job is to evaluate Anki notes, and classify notes that are not formatted correctly.

Requirements:
* Only check formatting
* Notes should be in HTML format; for instance: newline should "<br>", "<" should be "&lt;", etc.
* Preserve images and media on the original note
* Use code block: ```<language><br><command><br>```
* Use inline code format for very short commands: `iw`, `d`, etc.

Examples of good notes:

Example 1:

    Front: Create soft link
    Back:  ```bash<br>$ ln -s <file> <link><br>```
    Tags:  ['linux']

Example 2:

    Front: Zip destination option
    Back:  ```bash<br>$ unzip <file> -d <path><br>```
    Tags:  ['linux']

Example 3:

    Front: Extract zip files
    Back:  ```bash<br>$ unzip <file><br>```
    Tags:  ['linux']

Example 4:

    Front: List directory content
    Back:  ```bash<br>$ ls <path><br>```
    Tags:  ['linux']

Examples of bad notes: 

Example 1:

    Front: Return to previous directory
    Back:  ```bash $ cd -```
    Tags:  ['linux']

    Reasoning: Missing newlines (<br> tags) in code block

Example 2: 

    Front: Remove delimiters
    Back:  ```ds <delimiter>```
    Tags:  ['nvim']

    Reasoning: Using triple backtick quotes without specifying the language and adding newlines (<br> tag) in code block

Example 3: 

    Front: Change Anki delimiters
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Mentioning the command is an Anki command when, in fact, it's a nvim command

Example 4: 

    Front: Text object for a sentence
    Back:  ```\
    Tags:  ['nvim']
    
    Reasoning: Missing command and not closing code block
"""

judge_json = LLMJudgeJSON(
    chat=chat, system_msg=SYSTEM_MSG_V2, user_msg_tmpl=user_msg_tmpl
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' is_correct=True
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' is_correct=True
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' is_correct=True
#######################
Ground Truth: False
Verdict: guid='I@*6RLEsm]' is_correct=True
#######################
Ground Truth: True
Verdict: guid='v5I1<L+^4k' is_correct=True
#######################
Ground Truth: False
Verdict: guid='if[&q~T8?V' is_correct=True
#######################
Ground Truth: True
Verdict: guid='L~{D]VMy.2' is_correct=True
#######################
Ground Truth: False
Verdict: guid='HhbnT=2&rE' is_correct=True
#######################
Ground Truth: False
Verdict: guid='"fmc=2!5#q6"' is_correct=True
#######################
Ground Truth: True
Verdict: guid='gy}T)rqdHN' is_correct=True
#######################
Ground Truth: True
Verdict: guid='w`zo)Q_qy+' is_correct=False
#######################
Ground Truth: True
Verdict: guid='IT@~i{IUdV'

This result is also surprising. We would have expected a few examples to help the model understand the expected formatting for these notes. 

#### Reasoning

One thing we would expect to improve performance is to ask the LLM judge to provide some reasoning for its decision before submitting a verdict. Let's see if that works.

In [8]:
SYSTEM_MSG_V3 = r"""
Your job is to evaluate the formatting of Anki note.

Properly formatted notes should:
* Use hybrid markdown format. For instance, use "<br>" to signal a new line, "&lt;" for "<" symbol, etc.
* Preserve images and media on the original note
* Wrap code in a code block: ```<language><br><command><br>```
* Wrap math in a LaTeX block: $ <math equation> $. Also, ensure that we do not use double backslashes, \\, in a LaTeX block, as that won't be correctly displayed
* Wrap short commands in an inline code block: `iw`, `d`, etc.

Provide concise reasoning for your answer and a True/False answer, where True means the note is formatted correctly and False means the note is not properly formatted.
"""


class Review2(BaseModel):
    guid: str
    reasoning: str
    is_correct: bool


judge_json = LLMJudgeJSON(
    chat=chat,
    system_msg=SYSTEM_MSG_V3,
    user_msg_tmpl=user_msg_tmpl,
    review_model=Review2,
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' reasoning='The note is properly formatted. It uses hybrid markdown format, preserves the original code, and wraps the code in a code block.' is_correct=True
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' reasoning='The note is properly formatted as it uses the required tags and does not contain any code, math, or media that would require special formatting.' is_correct=True
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' reasoning='The note is properly formatted as it uses hybrid markdown format, preserves the image, and uses code blocks for commands.' is_correct=True
#######################
Ground Truth: False
Verdict: guid='I@*6RLEsm]' reasoning="This note is properly formatted as it uses the required hybrid markdown format, preserves the original note's formatting, and uses the correct formatting for code and math." is_correct=True
#######################
Ground Truth: True
Verdict: guid='v5

#### Provide original notes

The LLM judge made some mistakes due to not having access to the original note. For instance, the LLM judge would not know if the LLM editor removed an image or block of code present in the original note. Let's try to address that.

In [9]:
SYSTEM_MSG_V4 = r"""
The user will share two Anki notes: the original and improved versions. The improved version should be factually the same as the original note but more concise and might have a slightly different format.

Your job is to evaluate the formatting of the improved version. Properly formatted notes should:
* Use hybrid markdown format. For instance, when relevant, use `<br>` to signal a new line, `&nbsp;` to signal a non-breaking space, `&lt;` for `<` symbol, etc. Cards with just one sentence that do not include code or math equations do not require special formatting
* Preserve images and media present on the original note
* Code should be wrapped in a code block: ```<language><br><command><br>```. One line command should use an inline code block: `iw`, `d`, `:copen`, etc.
* Mathematical equations should be wrapped in a LaTeX block: $ <math equation> $. Also, ensure that we do not use double backslashes, \\, in a LaTeX block, as that won't be correctly displayed

Provide concise reasoning (no more than two sentences) for your answer and a True/False answer, where True means the improved note is formatted correctly and False means the improved note is not properly formatted.
"""

judge_json = LLMJudgeJSON(
    chat=chat,
    system_msg=SYSTEM_MSG_V4,
    user_msg_tmpl=user_msg_tmpl,
    review_model=Review2,
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' reasoning='The improved note uses hybrid markdown format for the code block, but it should be wrapped in a code block with a language specified, such as `bash`. The link should be on a new line.' is_correct=False
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' reasoning='The improved note is not properly formatted because it does not use the hybrid markdown format. It should be wrapped in a code block or use inline code for the command.' is_correct=False
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' reasoning='The improved note is not properly formatted because it does not use the hybrid markdown format. The note should be wrapped in a code block or use inline code for commands.' is_correct=False
#######################
Ground Truth: False
Verdict: guid='I@*6RLEsm]' reasoning='The improved note is not properly formatted because it does not use the hybrid markdown format. The note should be wra

In [10]:
SYSTEM_MSG_V5 = r"""
The user will share two Anki notes: the original and improved versions. Here is an example of the input:

Original note:
Front: <original front>
Back: <original back>
Tags: <original tags>

Improved note:
Front: <improved front>
Back: <improved back>
Tags: <improved tags>

The improved version should be factually the same as the original note but more concise and might have a slightly different format.

Evaluate the formatting of the improved note, both front and back cards. Properly formatted notes should:
* Use hybrid markdown format. For instance, when relevant, use `<br>` to signal a new line, `&lt;` for `<` symbol, etc. Cards with just one sentence that do not include code or math equations do not require special formatting
* Preserve images and media present on the original note
* Code should be wrapped in a code block: ```<language><br><command><br>```. One line command should use an inline code block: `iw`, `d`, `:copen`, etc.
* Mathematical equations should be wrapped in a LaTeX block: $ <math equation> $. Also, ensure that we do not use double backslashes, \\, in a LaTeX block, as that won't be correctly displayed

The original note is provided only as a reference to ensure we are preserving the intention of the note and any media/code example.

Provide concise reasoning for your answer and a True/False answer, where True means the improved note is formatted correctly and False means the improved note is not properly formatted.
"""

judge_json = LLMJudgeJSON(
    chat=chat,
    system_msg=SYSTEM_MSG_V5,
    user_msg_tmpl=user_msg_tmpl,
    review_model=Review2,
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' reasoning='The improved note is properly formatted. It uses hybrid markdown format, preserves the original command, and wraps the code in a code block.' is_correct=True
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' reasoning='The note is properly formatted. It uses a simple sentence and does not require any special formatting. The tags are also properly formatted as a list.' is_correct=True
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' reasoning='The improved note is not properly formatted. It does not use the hybrid markdown format. The front and back cards are just one sentence and do not include code or math equations, so they do not require special formatting, but they should be in a proper markdown format. For instance, they should be in a single line, and the text should be separated by a space or a period. The improved note is missing this.' is_correct=False
#######################
Gr

It seems the LLM judge is often confusing the old and new notes. Let's try to refactor the prompt (thank you Claude).

In [11]:
SYSTEM_MSG_V6 = r"""You will be presented with two versions of an Anki note: the original and an improved version. Your task is to evaluate the formatting of the improved note, ensuring it maintains the original's factual content while potentially being more concise or having a slightly different format.

## Input Format

The input will be structured as follows:

```
Original note:
Front: <original front content>
Back: <original back content>
Tags: <original tags>

Improved note:
Front: <improved front content>
Back: <improved back content>
Tags: <improved tags>
```

## Evaluation Criteria

Assess the improved note's formatting for both front and back sides. A properly formatted note should:

1. Utilize hybrid markdown format:
   - Use `<br>` for line breaks when necessary
   - Use `&lt;` for `<` symbol, `&gt;` for `>`, etc.
   - Simple cards with a single sentence and no code/math may not require special formatting

2. Preserve all images and media from the original note

3. Format code correctly:
   - Multi-line code: Use code blocks with language specification
     ```<language>
     <code>
     ```
   - Single-line commands: Use inline code blocks, e.g., `command`

4. Format mathematical equations properly:
   - Wrap in LaTeX blocks: $ <equation> $
   - Avoid double backslashes (\\) within LaTeX blocks

5. Maintain the original note's intent and key information

## Output Format

Provide your evaluation as follows:

1. A concise reasoning for your assessment, highlighting any formatting issues or improvements
2. A boolean verdict: 
   - `True` if the improved note is correctly formatted
   - `False` if the improved note has formatting issues

Example output:
```
Reasoning: The improved note correctly uses hybrid markdown, preserves images, and properly formats code and equations. Line breaks are appropriately handled with <br> tags.

Verdict: True
```

Remember, the original note serves as a reference to ensure the improved version preserves the intended content and any necessary media or code examples.
"""

judge_json = LLMJudgeJSON(
    chat=chat,
    system_msg=SYSTEM_MSG_V6,
    user_msg_tmpl=user_msg_tmpl,
    review_model=Review2,
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' reasoning="The improved note correctly uses hybrid markdown, preserves images, and properly formats code. However, the improved note could be more concise by removing the unnecessary line break and using a more specific language specification for the code block. The original note's intent and key information are maintained." is_correct=True
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' reasoning='The original note is a simple definition, and the improved note maintains the same content and tags. However, it does not utilize the hybrid markdown format as it does not require any special formatting. The note is concise and effectively conveys the definition of an integer.' is_correct=True
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' reasoning='The improved note is a simple sentence and does not require special formatting. It uses the correct markdown for the front side of the note. However, th

Let's see if Meta AI can help here.

In [12]:
SYSTEM_MSG_V7 = r"""# Anki Note Formatting Evaluation
Provide two Anki notes:

Original Note
Front: <original front>
Back: <original back>
Tags: <original tags>

Improved Note
Front: <improved front>
Back: <improved back>
Tags: <improved tags>

# Evaluation Criteria
Assess the formatting of the improved note, ensuring it:
1. Uses hybrid markdown format (e.g., <br> for new lines, &lt; for < symbol)
2. Preserves images and media from the original note
3. Formats code using:
  * Code blocks (<language><br><command><br>) for multiple lines
  * Inline code blocks () for single-line commands (e.g., iw, d, :copen)
4. Formats mathematical equations using LaTeX blocks ($ <math equation> $) without double backslashes (\\)

# Requirements
* The improved note should be factually equivalent to the original note.
* Conciseness and formatting im4provements are expected.

# Response Format
Provide:
* A concise reasoning for your evaluation
* A boolean answer: True (improved note is properly formatted) or False (improved note is not properly formatted)
"""

judge_json = LLMJudgeJSON(
    chat=chat,
    system_msg=SYSTEM_MSG_V7,
    user_msg_tmpl=user_msg_tmpl,
    review_model=Review2,
)
review_notes(ra=ra, judge=judge_json)

Ground Truth: True
Verdict: guid='"bv5;TaZ#F2"' reasoning='The improved note uses hybrid markdown format, preserves images and media, and formats code using code blocks. However, it does not format the inline code block (git submodule add) correctly. It should be formatted as (git submodule add).' is_correct=False
#######################
Ground Truth: True
Verdict: guid='DkiJ0e50/*' reasoning='The original note is a simple definition. To improve it, we need to format it properly using markdown and preserve the image and tags.' is_correct=True
#######################
Ground Truth: True
Verdict: guid='rvo9[&8:`q' reasoning='The improved note should be factually equivalent to the original note, and the provided original note is a simple definition. To improve the formatting, we need to consider the evaluation criteria. However, the original note does not contain any code, images, or mathematical equations, so the improved note should be identical to the original note.' is_correct=True
###

### Create helper functions to facilitate reviewing notes

Let's create a `pandas.DataFrame` with both: original note, edited note, and LLM review. This will facilitate our review of the LLM reviews.

In [13]:
df_eval = pd.read_csv("../data/eval.txt", sep="\t", header=None)
df_eval.columns = ["guid", "score"]
df_eval.head()

Unnamed: 0,guid,score
0,bv5;TaZ#F2,True
1,DkiJ0e50/*,True
2,rvo9[&8:`q,True
3,I@*6RLEsm],False
4,v5I1<L+^4k,True


In [14]:
results = ra._ReviewApp__reviews
dict_data = [item.dict() for item in results]
df_scores = pd.DataFrame(dict_data)
df_scores.head()

AttributeError: 'str' object has no attribute 'dict'

In [None]:
a = [note.dict() for note in deck]
df_notes = pd.DataFrame(a)
df_notes.head()

In [None]:
x = pd.merge(df_notes, df_scores, how="inner", on="guid")
x = x[x.tags.apply(lambda a: "life" not in a)]  # exclude personal notes
print(x.shape)
x.head(25)

In [None]:
def validate_interactive_session(session_text):
    lines = session_text.strip().split("<br>")
    input_pattern = r"^>>> .*$"
    continuation_pattern = r"^... .*$"
    output_pattern = r"^(?!>>>)(?!\.\.\.)"

    state = "expecting_input"
    for i, line in enumerate(lines, 1):
        if state == "expecting_input":
            if not (
                re.match(input_pattern, line) or re.match(continuation_pattern, line)
            ):
                return False, f"Line {i}: Expected input (>>> or ...), got: {line}"
            state = "optional_output"
        elif state == "optional_output":
            if re.match(input_pattern, line) or re.match(continuation_pattern, line):
                state = "expecting_input"
            elif not re.match(output_pattern, line):
                return False, f"Line {i}: Invalid output format: {line}"

    return True, "Valid interactive session format"


def validate_code_block_format(block):
    # Check if the block starts and ends with ```
    if not (block.startswith("```") and block.endswith("```")):
        return False, "Code block should start and end with ```"

    # Remove the opening and closing ```
    content = block[3:-3].strip()

    # Check if the block starts with a language specifier
    if not re.match(r"^[\w-]+<br>", content):
        return (
            False,
            "Code block should start with a language specifier followed by <br>",
        )

    # Split the content by <br> tags
    lines = content.split("<br>")

    # Check if the last line is empty (as it should end with <br>)
    if lines[-1].strip() != "":
        return False, "Code block should end with <br>"

    # Check if there are any empty lines in between (which would indicate missing <br>)
    if any(line.strip() == "" for line in lines[1:-1]):
        return (
            False,
            "Code block should not have empty lines. Use <br> for line breaks.",
        )

    return True, "Valid code block format"


def validate_hybrid_markdown(content):
    issues = []

    # Check for double backslashes in LaTeX blocks
    latex_blocks = re.findall(r"\$(.*?)\$", content, re.DOTALL)
    for block in latex_blocks:
        if "\\\\" in block:
            issues.append(
                "Double backslash (\\\\) found in LaTeX block. This may cause rendering issues."
            )

    # Check for unmatched dollar signs
    # Split the content into code blocks and non-code blocks
    parts = re.split(r"(```[\s\S]*?```)", content)

    total_dollar_count = 0
    for part in parts:
        if part.startswith("```") and part.endswith("```"):
            # This is a code block
            is_valid, message = validate_code_block_format(part)
            if not is_valid:
                issues.append(f"Invalid code block format: {message}")

            if part.startswith("```python"):
                # Check if it's an interactive Python session
                session_content = part[13:-3].strip()  # Remove ```python<br> and ```
                is_valid, message = validate_interactive_session(session_content)
                if not is_valid:
                    issues.append(
                        f"Invalid Python interactive session in code block: {message}"
                    )
        else:
            # Count dollar signs in non-code block parts
            dollar_count = part.count("$")
            total_dollar_count += dollar_count

    # Check if the total number of dollar signs outside code blocks is odd
    if total_dollar_count % 2 != 0:
        issues.append(
            "Unmatched dollar signs outside code blocks. LaTeX may not render correctly."
        )

    # Check for common Markdown syntax errors
    if "```" in content and content.count("```") % 2 != 0:
        issues.append(
            "Unmatched code block delimiters (```). Code blocks may not render correctly."
        )

    return issues

In [None]:
n_reviews = 100

for row in x.iloc[:n_reviews].iterrows():
    note = row[1]
    print(f"Front: {note['front']}\nBack: {note['back']}\nTags: {note['tags']}")
    for side in ["front", "back"]:
        a = note[side]
        issues = validate_hybrid_markdown(a)
        if issues:
            for issue in issues:
                print(f"Issue {side}: {issue}")
        else:
            print(f"Issue {side}: None")
    print("\n")

Common errors are:

* Missing `<img>`
* Wrong prompt (e.g., `>>`, missing `$`)
* Missing `<br>` inside code block
* Missing `<br>` outside code block
* `\\` in LaTeX
* References (should we remove them?)
* Trailing `.` (full stop)
* Using code block for note that does not contain code
* "```bash" for keymap
* Missing language in code block
* Unmatched code block delimiter (missing trailing "```")
* Missing inline code block for keymap or short commands

### Todo

- [ ] Create a dataset to measure LLM judge's alignment with human preference 
- [ ] Use _reflection_ agentic workflow to improve notes