
# OpenProBono RAG Evaluation, Part 3
## Retrieval Augmented Generation (RAG)

This notebook is for evaluating a dataset of question-answer pairs using LLM-as-a-judge to compute the accuracy of your RAG system.

See Part 1 to learn to create synthetic question-answer pairs using LLMs, and Part 2 to load sources into a vector database and benchmark retrieval for Q&A.

In [None]:
%pip install -q tqdm pandas pymilvus

In [None]:
import json
from pathlib import Path

import pandas as pd
import pymilvus
from tqdm.auto import tqdm

from app.chat_models import chat
from app.milvusdb import create_collection, query
from app.models import ChatModelParams, EncoderParams, OpenAIModelEnum

### 1: Build our RAG System

We already loaded sources into our vector database in Part 2.

Now we can configure an LLM and augment its generations with results from our vector database, improving its ability to answer questions about those sources.

We also write functions to test the LLMs without using RAG so we can observe their baseline performance.

#### 1.1 Reader - LLM 💬

>In this part, the LLM Reader reads the retrieved documents to formulate its answer.

OpenProBono offers various LLM Readers using models offered by OpenAI, Anthropic, and more. We are mainly using OpenAI models right now.

In [None]:
RAG_PROMPT_TEMPLATE = """Using the information contained in the context,
give a comprehensive answer to the users question.
Respond only to the question asked, response should be concise and relevant to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
{context}
"""

NORAG_PROMPT = """You are a legal expert, tasked with answering any question about law. Give a comprehensive answer to the question.
Respond only to the question asked. Your response should be concise and relevant to the question.
If the answer cannot be deduced, do not give an answer."""

In [None]:
def answer_question(
    question: str,
    chat_model : ChatModelParams,
    context: str | None = None,
) -> str:
    """Answer a question using an LLM."""
    if chat_model.engine == "hive":
        msg_history = [{"role":"user", "content":question}]
        answer, _ = chat(
            msg_history,
            chat_model,
            temperature=0,
            # use the rag prompt to pass any additional context from a QA evaluation set
            system=NORAG_PROMPT if not context else RAG_PROMPT_TEMPLATE.format(
                context=context,
            ),
        )
    elif chat_model.engine == "anthropic":
        msg_history = [{"role":"user", "content":question}]
        response = chat(
            msg_history,
            chat_model,
            temperature=0,
            # use the rag prompt to pass any additional context from a QA evaluation set
            system=NORAG_PROMPT if not context else RAG_PROMPT_TEMPLATE.format(
                context=context,
            ),
        )
        answer = response.content[-1].text
    else:
        msg_history = [
            {
                "role": "system",
                "content": NORAG_PROMPT if not context else RAG_PROMPT_TEMPLATE.format(
                    context=context,
                ),
            },
            {
                "role": "user",
                "content": question,
            },
        ]
        response = chat(
            msg_history,
            chat_model,
            temperature=0,
        )
        answer = response.choices[0].message.content
    return answer


def answer_with_rag(
    question: str,
    chat_model: ChatModelParams,
    collection_name: str | None = None,
    k: int = 4,
) -> tuple[str, list[dict]]:
    """Answer a question using RAG."""
    # Gather documents with retriever
    relevant_docs = query(collection_name, question, k)
    relevant_docs = [doc["entity"]["text"] for doc in relevant_docs["result"]]  # keep only the text

    # Build the final prompt context
    context = "".join([f"Document {i!s}:::\n" + doc.rstrip() for i, doc in enumerate(relevant_docs)])
    answer = answer_question(question, chat_model, context)
    return answer, relevant_docs

### 2: Benchmarking the RAG System

>The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evlauation dataset.
>
>To this end, **we setup a judge agent**. ⚖️🤖
>
>Out of the different RAG evaluation metrics, we choose to focus only on faithfulness since it the best end-to-end metric of our system's performance.

>💡 *In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in Prometheus's prompt template: this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples.*
>
>💡 *Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement.*

In [None]:
def run_qa(
    chat_model: ChatModelParams,
    eval_dataset: pd.DataFrame,
    output_file: str,
    collection_name: str | None = None,
    verbose: bool | None = True,
    test_settings: str | None = None,  # To document the test settings used
    use_embedding: bool | None = True,
    synth_data: bool | None = True,
):
    """Run a chat model through the given Q&A dataset and save the results to the given output file."""
    try:  # load previous generations if they exist
        with Path(output_file).open() as f:
            outputs = json.load(f)
    except:
        outputs = []

    for _, example in tqdm(eval_dataset.iterrows(), total=len(eval_dataset)):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        if use_embedding:
            answer, relevant_docs = answer_with_rag(question, chat_model, collection_name)
        else:
            # change/remove context if necessary
            answer = answer_question(question, chat_model, example["contract"])
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')

        result = {
            "question": question,
            "true_answer": example["answer"],
            "generated_answer": answer,
        }
        if synth_data:
            result["source_doc"] = example["source_doc"]
        if use_embedding and relevant_docs:
            result["retrieved_docs"] = list(relevant_docs)
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with Path(output_file).open("w") as f:
            json.dump(outputs, f)

In [None]:
EVALUATION_SYSTEM_MSG = "You are a fair evaluator language model."

EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

In [None]:
eval_temperature = 0


def evaluate_answers(
    answer_path: str,
    evaluator_llm: ChatModelParams,
) -> None:
    """Evaluate generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if Path(answer_path).is_file():  # load previous generations if they exist
        with Path(answer_path).open() as f:
            answers = json.load(f)

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_llm.model}" in experiment:
            continue

        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_sys_msg = {"role": "system", "content": EVALUATION_SYSTEM_MSG}
        eval_msg = {"role": "user", "content": eval_prompt}
        eval_response = chat([eval_sys_msg, eval_msg], evaluator_llm, temperature=eval_temperature)
        eval_result = eval_response.choices[0].message.content
        feedback, score = (item.strip() for item in eval_result.split("[RESULT]"))
        experiment[f"eval_score_{evaluator_llm.model}"] = score
        experiment[f"eval_feedback_{evaluator_llm.model}"] = feedback

        with Path(answer_path).open(mode="w") as f:
            json.dump(answers, f)

>🚀 Let's run the tests and evaluate answers!👇

Read the dataset from the file:

In [None]:
from app.eval.evaluations import legalbench

couples_df = legalbench("consumer_contracts_qa").to_pandas()

In [None]:
from datetime import UTC, datetime

from app.knowledge_bases import KnowledgeBaseNC

eval_data = KnowledgeBaseNC()
gpt_3_5 = ChatModelParams(engine="openai", model=OpenAIModelEnum.gpt_3_5)
gpt_4o = ChatModelParams(engine="openai", model=OpenAIModelEnum.gpt_4_o)
hive_7b = ChatModelParams(engine="hive", model="hive-7b")
hive_70b = ChatModelParams(engine="hive", model="hive-70b")
claude = ChatModelParams(engine="anthropic", model="claude-3-sonnet-20240229")
evaluator_llm = gpt_4o
reader_models = [gpt_3_5, gpt_4o, hive_7b, hive_70b, claude]
use_embedding = False # change to True if you want to enable RAG
filenames = {
    gpt_3_5.model: "gpt_35",
    gpt_4o.model: "gpt_4o",
    hive_7b.model: "hive-7b",
    hive_70b.model: "hive-70b",
    claude.model: "claude-sonnet",
}


def evaluate_model(
    chat_model: ChatModelParams,
    settings_name: str,
    collection_name: str | None = None,
    use_embedding: bool | None = None, # To answer questions with RAG or not
    synth_data: bool | None = None, # To include source_doc in eval result if available
):
    # run evaluations with a configured knowledge base
    output_file_name = f"data/consumer_contractqa/{settings_name}.json"

    print(f"Running evaluation for {settings_name}:")

    print("Running RAG...")
    run_qa(
        chat_model=chat_model,
        eval_dataset=couples_df,
        output_file=output_file_name,
        collection_name=collection_name,
        verbose=True,
        test_settings=settings_name,
        use_embedding=use_embedding,
        synth_data=synth_data,
    )

    print("Running evaluation...")
    evaluate_answers(
        output_file_name,
        evaluator_llm,
    )


if use_embedding:
    vdb_basename = "Eval_" + datetime.now(UTC).strftime("%Y%m%d")
    idx = 1
    for encoder in [EncoderParams(name=OpenAIModelEnum.embed_small, dim=768)]:
        for chunk_hardmax, chunk_softmax, overlap in [(5000, 2000, 500)]:
            print("Loading knowledge base embeddings...")
            collection_name = f"{vdb_basename}_{idx}"
            idx += 1
            if not pymilvus.utility.has_collection(collection_name):
                description = (
                    f"Hardmax = {chunk_hardmax}, "
                    f"Softmax = {chunk_softmax}, "
                    f"Overlap = {overlap}, "
                    f"Encoder = {encoder.name}."
                )
                create_collection(
                    collection_name,
                    encoder,
                    description,
                )
                eval_data.populate_database(
                    collection_name,
                    chunk_hardmax,
                    chunk_softmax,
                    overlap,
                )
            for llm in reader_models:
                settings_name = filenames[llm.model]
                evaluate_model(
                    llm,
                    settings_name,
                    collection_name,
                )
else:
    for llm in reader_models:
        settings_name = filenames[llm.model]
        evaluate_model(llm, settings_name, None)

Plot the average evaluation scores as accuracy percentage:

In [None]:
import matplotlib.pyplot as plt

results_gpt35_employ_norag = pd.read_json("data/consumer_contractqa/gpt_35.json", orient='records')["eval_score_gpt-4o"]
results_gpt4o_employ_norag = pd.read_json("data/consumer_contractqa/gpt_4o.json", orient='records')["eval_score_gpt-4o"]
results_hive7b_employ_norag = pd.read_json("data/consumer_contractqa/hive-7b.json", orient='records')["eval_score_gpt-4o"]
results_hive70b_employ_norag = pd.read_json("data/consumer_contractqa/hive-70b.json", orient='records')["eval_score_gpt-4o"]
results_claude_employ_norag = pd.read_json("data/consumer_contractqa/claude-sonnet.json", orient='records')["eval_score_gpt-4o"]


# Define the x-axis labels and values (evaluation models)
models = ["gpt-3.5-turbo-0125", "gpt-4o", "hive-7b", "hive-70b", "claude-3-sonnet"]
# Define the y-axis values (average scores)
y_values = [
    (results_gpt35_employ_norag.mean() / 5) * 100,
    (results_gpt4o_employ_norag.mean() / 5) * 100,
    (results_hive7b_employ_norag.mean() / 5) * 100,
    (results_hive70b_employ_norag.mean() / 5) * 100,
    (results_claude_employ_norag.mean() / 5) * 100,
]
# Define some colors to make it pretty
colors = ["lightgreen", "green", "lightblue", "blue", "brown"]

# Create the bar graph
plt.bar(models, y_values, color=colors)

# Add values above each bar
for i, value in enumerate(y_values):
    plt.text(i, value, f"{value:.2f}", ha="center")

# Set title and labels
plt.title("Evaluation Accuracy Comparison - GPT-4o Judge")
plt.xlabel("Model")
plt.ylabel("Average Evaluation Accuracy")

# Show the plot
plt.show()