
# OpenProBono RAG Evaluation

This notebook was created using https://huggingface.co/learn/cookbook/en/rag_evaluation as a guide. Anything inside double quotations is a direct quote from this source, and much of the text is paraphrased. Prompts and code have been modified for use by OpenProBono.

### 0: Install and import dependencies.

In [20]:
!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets

In [21]:
import json
from pathlib import Path
from typing import List, Optional, Tuple

import openai
import pandas as pd
from tqdm.auto import tqdm

import milvusdb

pd.set_option("display.max_colwidth", None)

### 1: Load our knowledge base

For this step, we have already loaded the data we wish to evaluate into a `Collection` in our Milvus vector database.

#### 1.1: Load sources
 
We have a list of sources we use to filter the documents so we can generate questions about one document at a time. The sources can be files or URLs. We load the list of sources for this example from a file named `urls`.

In [22]:
with Path("urls").open() as f:
    urls = [line.strip() for line in f.readlines()]

#### 1.2: Load documents

Now that we loaded our source URLs, we can write a function that will get the chunks associated the source from Milvus. We use a boolean expression filter to get the right chunks.

In [23]:
collection_name = milvusdb.COURTROOM5

def load_url_documents(url: str):
    expr = f"metadata['url']=='{url}'"
    hits = milvusdb.get_expr(collection_name, expr)["result"]
    for i in range(len(hits)):
        hits[i]["url"] = hits[i]["metadata"]["url"]
        hits[i]["page_number"] = hits[i]["metadata"]["page_number"]
        del hits[i]["pk"]
        del hits[i]["metadata"]
    return hits

#### 1.3: Setup question generation agents

We will use `gpt-3.5-turbo` for question generation.

In [24]:
model = "gpt-3.5-turbo"
llm_client = openai.OpenAI()

def call_llm(client: openai.OpenAI, prompt: str):
    prompt_msg = {"role": "system", "content": prompt}
    response = client.chat.completions.create(
        model=model,
        messages=[prompt_msg],
        max_tokens=1000,
    )
    return response.choices[0].message.content

Here is the prompt that will be given to our LLM to generate questions about our documents:

In [25]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

We will generate 3 samples to test.

In [26]:
import random

N_GENERATIONS = 3  # We intentionally generate only 3 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(load_url_documents(urls[4]), N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(llm_client, QA_generation_prompt.format(context=sampled_context["text"]))
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0].rstrip()
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context["text"],
                "question": question,
                "answer": answer,
                "source_doc": sampled_context["url"],
            }
        )
    except:
        continue

Generating 3 QA couples...


100%|██████████| 3/3 [00:03<00:00,  1.21s/it]


In [27]:
display(pd.DataFrame(outputs).head(1))

Unnamed: 0,context,question,answer,source_doc
0,"Chapter 1D.\n\nPunitive Damages.\n\n§ 1D-1. Purpose of punitive damages.\n\nPunitive damages may be awarded, in an appropriate case and subject to the provisions of this Chapter, to punish a defendant for egregiously wrongful acts and to deter the defendant and others from committing similar wrongful acts. (1995, c. 514, s. 1.)\n\n§ 1D-5. Definitions.\n\nAs used in this Chapter:\n\n(1)\n\n(2) (3)\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n""Claimant"" means counterclaimant, including cross-claimant, or third-party plaintiff, seeking recovery of punitive damages. In a claim for relief in which a party seeks recovery of punitive damages related to injury to another person, damage to the property of another person, death of another person, or other harm to another person, ""claimant"" includes any party seeking recovery of punitive damages. ""Compensatory damages"" includes nominal damages. ""Defendant"" means a party, including a counterdefendant, cross-defendant, or third-party defendant, from whom a claimant seeks relief with respect to punitive damages. ""Fraud"" does not include constructive fraud unless an element of intent is present. ""Malice"" means a sense of personal ill will toward the claimant that activated or incited the defendant to perform the act or undertake the conduct that resulted in harm to the claimant. ""Punitive damages"" means extracompensatory damages awarded for the purposes set forth in G.S. 1D-1. ""Willful or wanton conduct"" means the conscious and intentional disregard of and indifference to the rights and safety of others, which the defendant knows or should know is reasonably likely to result in injury, damage, or other harm. ""Willful or wanton conduct"" means more than gross negligence. (1995, c. 514, s. 1.)","What is the definition of ""malice"" in the context of punitive damages?","""Malice"" means a sense of personal ill will toward the claimant that activated or incited the defendant to perform the act or undertake the conduct that resulted in harm to the claimant.",https://www.ncleg.gov/EnactedLegislation/Statutes/PDF/ByChapter/Chapter_1D.pdf


#### 1.4 Setup question critique agents

The generated questions can be flawed in many ways. We use an agent to determine if a generated question meets the following criteria, given in [this paper](https://huggingface.co/papers/2312.10003):

- **Groundedness**: can the question be answered from the given context?
- **Relevance**: is the question relevant to users? For instance, *"What are some of Thomas Jefferson's beliefs regarding the rights and liberties of individuals?"* is not relevant for OpenProBono users.
- **Standalone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? For instance, *"What does the term 'legal entity' refer to in this statute?"* is tailored for a particular statute, but unclear by itself.

"We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

💡 ***When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.***"


In [28]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to people learning about the legal system.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain specific legal definitions or entities like trier of fact or the Board of County Commissioners and still be a 5: it must simply be clear to an operator with access to legal documents what the question is about.

For instance, "What is included in compensatory damages in Chapter 1D?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [29]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue

Generating critique for each QA couple...


100%|██████████| 2/2 [00:05<00:00,  2.83s/it]


In [30]:
pd.set_option("display.max_colwidth", None)
generated_questions = pd.DataFrame.from_dict(outputs)
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ],
)

Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,"What is the definition of ""malice"" in the context of punitive damages?","""Malice"" means a sense of personal ill will toward the claimant that activated or incited the defendant to perform the act or undertake the conduct that resulted in harm to the claimant.",5,5,5
1,What must the trial court include in a written opinion when reviewing the evidence regarding punitive damages?,"The trial court must state its reasons for upholding or disturbing the finding or award, address the evidence as it relates to liability for or the amount of punitive damages, and do so with specificity.",5,5,5


In [31]:
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")

display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ],
)

Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,"What is the definition of ""malice"" in the context of punitive damages?","""Malice"" means a sense of personal ill will toward the claimant that activated or incited the defendant to perform the act or undertake the conduct that resulted in harm to the claimant.",5,5,5
1,What must the trial court include in a written opinion when reviewing the evidence regarding punitive damages?,"The trial court must state its reasons for upholding or disturbing the finding or award, address the evidence as it relates to liability for or the amount of punitive damages, and do so with specificity.",5,5,5
