
# OpenProBono RAG Evaluation

This notebook was created using https://huggingface.co/learn/cookbook/en/rag_evaluation as a guide. Anything inside double quotations is a direct quote from this source, and much of the text is paraphrased. Prompts and code have been modified for use by OpenProBono.

### 0: Install and import dependencies.

In [1]:
%pip install -q tqdm openai pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import json
from pathlib import Path

import openai
import pandas as pd
from tqdm.auto import tqdm

import chat_models
import encoders
import milvusdb

pd.set_option("display.max_colwidth", None)

  from .autonotebook import tqdm as notebook_tqdm


### 1: Load our knowledge base

For this step, we have already loaded the data we wish to evaluate into a `Collection` in our Milvus vector database.

In [3]:
collection_name = milvusdb.COURTROOM5

#### 1.1: Load sources
 
We have a list of sources we use to filter the documents so we can generate questions about one source at a time. The sources can be files or URLs. We load the list of sources for this example from a file named `urls`.

In [4]:
with Path("urls").open() as f:
    urls = [line.strip() for line in f.readlines()]

#### 1.2: Load documents

Now that we loaded our source URLs, we can write a function that will get the chunks associated the source from Milvus. We use a boolean expression filter to get the right chunks.

In [5]:
def load_url_documents(url: str):
    expr = f"metadata['url']=='{url}'"
    hits = milvusdb.get_expr(collection_name, expr)["result"]
    for i in range(len(hits)):
        hits[i]["url"] = hits[i]["metadata"]["url"]
        hits[i]["page_number"] = hits[i]["metadata"]["page_number"]
        del hits[i]["pk"]
        del hits[i]["metadata"]
    return hits

#### 1.3: Setup question generation agents

We will use `gpt-3.5-turbo` for question generation.

In [26]:
llm_client = openai.OpenAI()

def call_llm(client: openai.OpenAI, prompt: str, model: str = chat_models.GPT_3_5, temperature: int = 0.7, extra_messages: list = []):
    prompt_msg = {"role": "system", "content": prompt}
    response = client.chat.completions.create(
        model=model,
        messages=[prompt_msg, *extra_messages],
        max_tokens=1000,
        temperature=temperature,
    )
    return response.choices[0].message.content

Here is the prompt that will be given to our LLM to generate questions about our documents:

In [7]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

We will generate 3 samples to test.

In [8]:
import random

N_GENERATIONS = 3  # We intentionally generate only 3 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(load_url_documents(urls[4]), N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(llm_client, QA_generation_prompt.format(context=sampled_context["text"]))
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0].rstrip()
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context["text"],
                "question": question,
                "answer": answer,
                "source_doc": sampled_context["url"],
            }
        )
    except:
        continue

Generating 3 QA couples...


100%|██████████| 3/3 [00:03<00:00,  1.30s/it]


In [9]:
display(pd.DataFrame(outputs).head(3))

Unnamed: 0,context,question,answer,source_doc
0,"(c)\n\nService as a law enforcement officer shall constitute service as (i) a ""criminal justice officer"" as defined in G.S. 17C-2(c) and (ii) a ""law enforcement officer"" for purposes of Article 12E of Chapter 143 of the General Statutes. For purposes of Article 12E of Chapter 143 of the General Statutes, the term ""employer,"" as defined in G.S. 143-166.50, shall be construed to include the Eastern Band of Cherokee Indians with respect to law enforcement officers.\n\n(d)\n\nA law enforcement officer may be enjoined from exercising his authority under color of State law pursuant to Article 13 of Chapter 160A of the General Statutes for the reasons set forth in G.S. 128-16 and pursuant to the provisions of Article 2 of Chapter 128 of the General Statutes.\n\n(e)\n\n(f)\n\nNothing contained in this Chapter or in Article 13 of Chapter 160A of the General\n\nStatutes shall be construed as doing any of the following:\n\n(1)\n\n(2)\n\nLimiting or revoking the authority of the Eastern Band of Cherokee Indians, the Cherokee Police Department, the Tribal Alcohol Law Enforcement Division of the Eastern Band of the Cherokee Indians, the Natural Resources Enforcement Agency of the Eastern Band of the Cherokee Indians, or any law enforcement officers or other persons appointed or employed by those entities, in the exercise of their inherent powers of self-government, or exercise of authority conferred by federal law, regulation, or common law. Modifying, either by way of enlargement or limitation, the jurisdiction of the Cherokee Tribal Courts.",What constitutes service as a law enforcement officer for purposes of Article 12E of Chapter 143 of the General Statutes?,"Service as a law enforcement officer shall constitute service as a ""criminal justice officer"" as defined in G.S. 17C-2(c) and a ""law enforcement officer"" for purposes of Article 12E of Chapter 143 of the General Statutes.",https://www.ncleg.gov/EnactedLegislation/Statutes/PDF/ByChapter/Chapter_1E.pdf
1,"Chapter 1E.\n\nEastern Band of Cherokee Indians.\n\nArticle 1.\n\nFull Faith and Credit.\n\n§ 1E-1. Full faith and credit.\n\nThe courts of this State shall give full faith and credit to a judgment, decree, or order signed by a judicial officer of the Eastern Band of Cherokee Indians and filed in the Cherokee Tribal Courts to the same extent as is given a judgment, decree, or order of another state, subject to the provisions of subsections (b) and (c) of this section; provided that the judgments, decrees, and orders of the courts of this State are given full faith and credit by the Tribal Courts of the Eastern Band of Cherokee Indians.\n\n(a)\n\nJudgments, decrees, and orders specified in subsection (a) of this section shall be given full faith and credit subject to the provisions of G.S. 1C-1705 and G.S. 1C-1708 and shall be considered a foreign judgment for purposes of these statutes.\n\n(b)\n\nAny limited driving privilege signed and issued by a Judge or Justice of the Cherokee Tribal Courts in accordance with the applicable provisions of Chapter 20 of the General Statutes and filed in the Cherokee Tribal Courts Clerk's Office shall be valid and given full faith and credit as specified in subsection (a) of this section. For purposes of this subsection, any reference to the issuing ""judge"" or ""court"" in the applicable provisions of Chapter 20 of the General Statutes shall be construed to mean the appropriate Judge or Justice in the Cherokee Tribal Courts or the appropriate Cherokee Tribal Court. (2001-456, s. 1; 2015-287, s. 1.)","What must the courts of North Carolina give to a judgment, decree, or order signed by a judicial officer of the Eastern Band of Cherokee Indians and filed in the Cherokee Tribal Courts?",Full faith and credit.,https://www.ncleg.gov/EnactedLegislation/Statutes/PDF/ByChapter/Chapter_1E.pdf
2,"the Cherokee Marshals Service,\n\n§ 1E-12. Qualification of law enforcement officers; limitations of authority.\n\nFor purposes of this section, ""law enforcement officer"" means any person appointed or employed as (i) Chief of Police of the Cherokee Police Department, Chief of the Cherokee Marshals Service, Chief of the Tribal Alcohol Law Enforcement Division of the Eastern Band of the Cherokee Indians, or Chief of the Natural Resources Enforcement Agency of the Eastern Band of the Cherokee Indians or (ii) a police officer, auxiliary police officer, marshal, alcohol law enforcement agent, reserve alcohol law enforcement agent, or resources officer with the Cherokee Police Department, the Cherokee Marshals Service, the Tribal Alcohol Law Enforcement Division of the Eastern Band of the Cherokee Indians, or the Natural Resources Enforcement Agency of the Eastern Band of the Cherokee Indians.\n\n(a)\n\nA law enforcement officer shall, prior to the exercise of the officer's authority pursuant to Article 13 of Chapter 160A of the General Statutes, comply with the provisions of Article 1 of Chapter 17C of the General Statutes and any rules or regulations adopted pursuant to the authority of Article 1 of Chapter 17C of the General Statutes. The courts of this State shall have the","What is the definition of a ""law enforcement officer"" within the Cherokee Marshals Service?","Any person appointed or employed as Chief of the Cherokee Marshals Service or a police officer, auxiliary police officer, marshal, alcohol law enforcement agent, reserve alcohol law enforcement agent, or resources officer with the Cherokee Marshals Service.",https://www.ncleg.gov/EnactedLegislation/Statutes/PDF/ByChapter/Chapter_1E.pdf


#### 1.4 Setup question critique agents

The generated questions can be flawed in many ways. We use an agent to determine if a generated question meets the following criteria, given in [this paper](https://huggingface.co/papers/2312.10003):

- **Groundedness**: can the question be answered from the given context?
- **Relevance**: is the question relevant to users? For instance, *"What are some of Thomas Jefferson's beliefs regarding the rights and liberties of individuals?"* is not relevant for OpenProBono users.
- **Standalone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? For instance, *"What does the term 'legal entity' refer to in this statute?"* is tailored for a particular statute, but unclear by itself.

"We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

💡 ***When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.***"


In [10]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to people learning about the legal system.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain specific legal definitions or entities like trier of fact or the Board of County Commissioners and still be a 5: it must simply be clear to an operator with access to legal documents what the question is about.

For instance, "Who decides the arbitration location without agreement?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [11]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue

Generating critique for each QA couple...


100%|██████████| 3/3 [00:11<00:00,  3.68s/it]


In [12]:
pd.set_option("display.max_colwidth", None)
generated_questions = pd.DataFrame.from_dict(outputs)
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ],
)

Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What constitutes service as a law enforcement officer for purposes of Article 12E of Chapter 143 of the General Statutes?,"Service as a law enforcement officer shall constitute service as a ""criminal justice officer"" as defined in G.S. 17C-2(c) and a ""law enforcement officer"" for purposes of Article 12E of Chapter 143 of the General Statutes.",2,5,1
1,"What must the courts of North Carolina give to a judgment, decree, or order signed by a judicial officer of the Eastern Band of Cherokee Indians and filed in the Cherokee Tribal Courts?",Full faith and credit.,5,4,5
2,"What is the definition of a ""law enforcement officer"" within the Cherokee Marshals Service?","Any person appointed or employed as Chief of the Cherokee Marshals Service or a police officer, auxiliary police officer, marshal, alcohol law enforcement agent, reserve alcohol law enforcement agent, or resources officer with the Cherokee Marshals Service.",5,5,5


In [13]:
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")

display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ],
)

Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
1,"What must the courts of North Carolina give to a judgment, decree, or order signed by a judicial officer of the Eastern Band of Cherokee Indians and filed in the Cherokee Tribal Courts?",Full faith and credit.,5,4,5
2,"What is the definition of a ""law enforcement officer"" within the Cherokee Marshals Service?","Any person appointed or employed as Chief of the Cherokee Marshals Service or a police officer, auxiliary police officer, marshal, alcohol law enforcement agent, reserve alcohol law enforcement agent, or resources officer with the Cherokee Marshals Service.",5,5,5


### 2: Build our RAG System

As stated previously, our data was already chunked and loaded into Milvus. Now we need an LLM Reader to read retrieved documents and formulate answers to questions.

OpenProBono offers various LLM Readers using models offered by OpenAI, HuggingFace, and more. We will create a simple LLM Reader connected to the OpenAI API for this notebook.

In [18]:
RAG_PROMPT_TEMPLATE = """
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.

Context:
{context}

Question: {question}
"""

In [19]:
def answer_with_rag(
    question: str,
    k: int = 7,
) -> tuple[str, list[dict]]:
    """Answer a question using RAG."""
    # Gather documents with retriever
    relevant_docs = milvusdb.query(collection_name, question, k)
    relevant_docs = [doc.entity.get("text") for doc in relevant_docs["result"]]  # keep only the text

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {i!s}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    answer = call_llm(llm_client, final_prompt, temperature=0)

    return answer, relevant_docs

### 3: Benchmarking the RAG System

"The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evlauation dataset.

To this end, **we setup a judge agent**. ⚖️🤖

Out of the different RAG evaluation metrics, we choose to focus only on faithfulness since it the best end-to-end metric of our system's performance.

💡 *In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in Prometheus's prompt template: this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples.*

💡 *Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement.*"

In [20]:
def run_rag_tests(
    eval_dataset: pd.DataFrame,
    output_file: str,
    verbose: bool | None = True,
    test_settings: str = "",  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with Path(output_file).open() as f:
            outputs = json.load(f)
    except:
        outputs = []

    for _, example in tqdm(eval_dataset.iterrows(), total=len(eval_dataset)):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": list(relevant_docs),
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with Path(output_file).open("w") as f:
            json.dump(outputs, f)

In [23]:
EVALUATION_SYSTEM_MSG = "You are a fair evaluator language model."

EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

In [24]:
evaluator_name = chat_models.GPT_4
eval_temperature = 0


def evaluate_answers(
    answer_path: str,
    evaluator_name: str,
) -> None:
    """Evaluate generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if Path(answer_path).is_file():  # load previous generations if they exist
        with Path(answer_path).open() as f:
            answers = json.load(f)

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_msg = {"role": "user", "content": eval_prompt}
        eval_result = call_llm(llm_client, EVALUATION_SYSTEM_MSG, evaluator_name, eval_temperature, [eval_msg])
        feedback, score = (item.strip() for item in eval_result.split("[RESULT]"))
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with Path(answer_path).open(mode="w") as f:
            json.dump(answers, f)

"🚀 Let's run the tests and evaluate answers!👇"

In [27]:
if not Path("./output").exists():
    Path.mkdir("./output")

settings_name = f"chunk-hardmax:{2500}_chunk-softmax:{1000}_embeddings:{encoders.OPENAI_3_SMALL}_dim:{768}_rerank:{False}_reader-model:{chat_models.GPT_3_5}"
output_file_name = f"./output/rag_{settings_name}.json"

print(f"Running evaluation for {settings_name}:")

run_rag_tests(
    eval_dataset=generated_questions,
    output_file=output_file_name,
    verbose=True,
    test_settings=settings_name,
)

print("Running evaluation...")
evaluate_answers(
    output_file_name,
    evaluator_name,
)

Running evaluation for chunk-hardmax:2500_chunk-softmax:1000_embeddings:text-embedding-3-small_dim:768_rerank:False_reader-model:gpt-3.5-turbo-0125:


100%|██████████| 2/2 [00:00<00:00, 5588.68it/s]


Running evaluation...


100%|██████████| 2/2 [00:08<00:00,  4.42s/it]
