
# OpenProBono RAG Evaluation, Part 1
## Synthetic Data

### About these notebooks

These notebooks are based on [RAG Evaluation](https://huggingface.co/learn/cookbook/en/rag_evaluation) by [Aymeric Roucher](https://huggingface.co/m-ric). They have been split into 4 parts that build on top of each other.

Any section inside block quotes is a direct quote. The general structure is copied and ideas are paraphrased. Prompts are adjusted to fit OpenProBono's use case. Code has been added and modified.

### Introduction

These notebooks demonstrate how you can evaluate your RAG (Retrieval Augmented Generation) by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.

>For an introduction to RAG, you can check [this other cookbook](https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain)!
>
>RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:
>
><img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" alt="RAG workflow: Knowledge Base, Embedding Model, LLM, LLM Prompt, and more" height="700"/>
>
>Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system’s performance! So let’s see how to evaluate our RAG system.
>

### Evaluating RAG performance

>Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.
>
>For our evaluation pipeline, we will need:
>
>1. An evaluation dataset with question - answer couples (QA couples)
>2. An evaluator to compute the accuracy of our system on the above evaluation dataset.
>
>➡️ It turns out, we can use LLMs to help us all along the way!
>
>1. The evaluation dataset will be synthetically generated by an LLM 🤖, and questions will be filtered out by other LLMs 🤖
>2. An [LLM-as-a-judge](https://huggingface.co/papers/2306.05685) agent 🤖 will then perform the evaluation on this synthetic dataset.
>
>**Let’s dig into it and start building our evaluation pipeline!**

### 0: Install and import dependencies

In [None]:
%pip install -q tqdm openai pandas langchain unstructured

In [None]:
from pathlib import Path

import pandas as pd
from tqdm.auto import tqdm

pd.set_option("display.max_colwidth", None)

### 1: Build a synthetic dataset for evaluation

>We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.
>
>Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.

#### 1.1: Load sources
 
We have a list of sources we use to generate questions about one source at a time using a generator function. The sources can be files or URLs.

For this example our knowledge base is the NC General Statutes. We can access them by chapter or by statute through a data loading class we wrote called [KnowledgeBaseNC](knowledge_bases.py:117).

More information on `KnowledgeBaseNC` and its base class [KnowledgeBase](knowledge_bases.py:16) is in Part 2. The only function we need from `KnowledgeBaseNC` in this notebook is [generate_elements()](knowledge_bases.py:131).

In [None]:
from knowledge_bases import KnowledgeBaseNC

eval_data = KnowledgeBaseNC()

We prepare the sources for question generation by *chunking* them. We will do this again in Part 2 using different chunking strategies when we store embedded chunks in a vector database for RAG.

In [None]:
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import Element


def chunk_elements_qa(elements: list[Element]) -> list[Element]:
    return chunk_by_title(
        elements,
        max_characters=10000,
        combine_text_under_n_chars=5000,
        new_after_n_chars=2500,
        overlap=2000,
    )

#### 1.2 Setup question generation agents

It is necessary to call an LLM for each agent in the eval dataset generation and RAG processes, so let's import the necessary functions from our `chat_models` module. We'll use `gpt-4o` for the model.

In [None]:
from chat_models import chat, messages
from models import ChatModelParams

cm_params = ChatModelParams(engine="openai", model="gpt-4o")

Here is the prompt that is given to our question generation agent:

In [None]:
QA_generation_prompt = """Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

QA_codify_prompt = """Your task is to come up with a question and answer about where a particular law is codified given a context.
Your question should be answerable with a specific statute or section from the context.
Your question should be formulated in the same style as questions users could ask a legal search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Question: (your question)
Answer: (your answer to the question)

Now here is the context.

Context: {context}\n
Output:::"""

For cost and time considerations, we generate a number of questions proportional to the length of the chunk.

In [None]:
import random


def generate_qa_couples(
    chunks: list[Element],
    max_num_questions: int,
    min_num_chars: int,
    chars_per_page: int,
    max_len_answer: int,
) -> list:
    # filter out chunks with length < min_num_chars
    chunks = [chunk for chunk in chunks if len(chunk.text) >= min_num_chars]
    char_count = sum([len(chunk.text) for chunk in chunks])
    if char_count < min_num_chars:
        return []

    question_count = min([max_num_questions, len(chunks), max([1, char_count // chars_per_page])])
    print(f"Generating {question_count} QA couples...")

    couples = []
    for sampled_context in tqdm(random.sample(chunks, question_count)):
        # Generate QA couple
        response = chat(
            messages(
                [[QA_generation_prompt.format(context=sampled_context.text), None]],
                cm_params.engine,
            ),
            cm_params,
            temperature=0.7,
        )
        output_qa_couple = response.choices[0].message.content
        try:
            question = output_qa_couple.split("Factoid question: ")[-1].split("Answer: ")[0].rstrip()
            answer = output_qa_couple.split("Answer: ")[-1].rstrip()
            assert len(answer) < max_len_answer, "Answer is too long"
            couples.append(
                {
                    "context": sampled_context.text,
                    "question": question,
                    "answer": answer,
                    "source_doc": sampled_context.metadata.url,
                },
            )
        except Exception as e:
            print(e)
    return couples

In [None]:
def generate_n_qa_couples(
    n: int,
    max_questions_per_chunk: int = 2,
    min_chars_per_chunk: int = 2500,
    chars_per_page: int = 2500,
    max_len_answer: int = 1000,
):
    if n < max_questions_per_chunk:
        # so we generate exactly n couples
        max_questions_per_chunk = n
    tot_couples = []
    for src, elements in eval_data.generate_elements():
        chunks = chunk_elements_qa(elements)
        couples = generate_qa_couples(
            chunks,
            # so we generate exactly n couples
            min([max_questions_per_chunk, n - len(tot_couples)]),
            min_chars_per_chunk,
            chars_per_page,
            max_len_answer,
        )
        tot_couples += couples
        if len(tot_couples) == n:
            break
    return tot_couples

In [None]:
NUM_QUESTIONS = 30
couples = generate_n_qa_couples(NUM_QUESTIONS)
display(pd.DataFrame(couples).head(NUM_QUESTIONS))

#### 1.5 Setup question critique agents

The generated questions can be flawed in many ways.

>We use an agent to determine if a generated question meets the following criteria, given in [this paper](https://huggingface.co/papers/2312.10003):

- **Groundedness**: can the question be answered from the given context?
- **Relevance**: is the question relevant to users? For instance, *"What are some of Thomas Jefferson's beliefs regarding the rights and liberties of individuals?"* is not relevant for OpenProBono users.

>One last failure case we’ve noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like *"What is the name of the function used in this guide?"*. We also build a critique agent for this criteria:

- **Standalone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? For instance, *"What does the term 'legal entity' refer to in this statute?"* is tailored for a particular statute, but unclear by itself.

>We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.
>
>💡 ***When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.***"
>
>We now build and run these critique agents.

In [None]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to assess a question answering system.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'according to this Article', the rating must be 1.
The questions can contain obscure legal definitions or entities like trier of fact or the North Carolina Self-Insurance Security Fund and still be a 5: it must simply be clear to an operator with access to legal documents what the question is about.

For instance, the context "On July 1 of each year, a maximum weekly benefit amount shall be computed." and question "When is the maximum weekly benefit amount computed and adjusted?" should receive a 1, since the question implicitly mentions the maximum weekly benefit amount, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the context and question.

Context: {context}\n
Question: {question}\n
Answer::: """

In [None]:
critique_cm = ChatModelParams(engine="openai", model="gpt-4o")

def generate_qa_critiques(couples: list[dict]):
    print("Generating critique for each QA couple...")
    for output in tqdm(couples):
        evaluations = {
            "groundedness": chat(
                messages(
                    [[question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]), None]],
                    critique_cm.engine,
                ),
                critique_cm,
            ).choices[0].message.content,
            "relevance": chat(
                messages(
                    [[question_relevance_critique_prompt.format(question=output["question"]), None]],
                    critique_cm.engine,
                ),
                critique_cm,
            ).choices[0].message.content,
            "standalone": chat(
                messages(
                    [[question_standalone_critique_prompt.format(context=output["context"], question=output["question"]), None]],
                    critique_cm.engine,
                ),
                critique_cm,
            ).choices[0].message.content,
        }
        try:
            for criterion, evaluation in evaluations.items():
                score, feedback = (
                    int(evaluation.split("Total rating: ")[-1].strip()),
                    evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
                )
                output.update(
                    {
                        f"{criterion}_score": score,
                        f"{criterion}_eval": feedback,
                    }
                )
        except Exception as e:
            print(e)
            continue

>Now let us filter out bad questions based on our critique agent scores:

In [None]:
def filter_questions(
    generated_questions: pd.DataFrame,
    min_groundedness: int,
    min_relevance: int,
    min_standalone: int,
):
    return generated_questions.loc[
        (generated_questions["groundedness_score"] >= min_groundedness)
        & (generated_questions["relevance_score"] >= min_relevance)
        & (generated_questions["standalone_score"] >= min_standalone)
    ]

In [None]:
generate_qa_critiques(couples)
couples_df = pd.DataFrame.from_dict(couples)
display(couples_df.head(NUM_QUESTIONS))

In [None]:
couples_df_filtered = filter_questions(couples_df, 4, 4, 4)
display(couples_df_filtered.head(NUM_QUESTIONS))

>Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

Save the dataset to a file:

In [None]:
with Path("data/NC-court/court_dataset.json").open("w") as f:
    f.write(couples_df_filtered.to_json())