
# OpenProBono RAG Evaluation, Part 2

This notebook is for evaluating a dataset of question-answer pairs using LLM-as-a-judge to compute the accuracy of your system.

See Part 1 to learn to create synthetic question-answer pairs using LLMs.

In [None]:
%pip install -q tqdm openai pandas langchain unstructured pymilvus

In [None]:
import json
from pathlib import Path

import pandas as pd
import pymilvus
from tqdm.auto import tqdm

from models import ChatModelParams, EncoderParams, OpenAIModelEnum

pd.set_option("display.max_colwidth", None)

### 2: Build our RAG System

We load the data for RAG into a `Collection` in our Milvus vector database.

To accomplish this, we must:

1. extract the content from our sources
2. chunk the extracted content
3. embed the chunks
4. insert them to the database

We already wrote generator functions to extract the content, `generate_statute_elements` and `generate_chapter_elements`, so our next step is chunking.

#### 2.1: Chunk sources into documents

>- In this part, **we split the documents from our knowledge base into smaller chunks**: these will be the snippets that are picked by the Retriever, to then be ingested by the Reader LLM as supporting elements for its answer.
>- The goal is to build semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting individual ideas.
>
>Many options exist for text splitting:
>
>- split every n words / characters, but this has the risk of cutting in half paragraphs or even sentences
>- split after n words / character, but only on sentence boundaries
>- **recursive split** tries to preserve even more of the document structure, by processing it tree-like way, splitting first on the largest units (chapters) then recursively splitting on smaller units (paragraphs, sentences).
>
>To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.
>
>[This space](https://huggingface.co/spaces/m-ric/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get.

We need a function to chunk the content in a given source into documents. This function should accept a list of items containing extracted text and relevant metadata from the source. In our generator function, `partition` returns a list of `Element` objects. We pass these into a chunking function, `chunk_by_title`.

In [None]:
from unstructured.chunking.title import chunk_by_title

#### 2.2: Embed documents

>The retriever acts like an internal search engine: given the user query, it returns the most relevant documents from your knowledge base.

An embedding model transforms the documents to vectors, and Milvus creates an index over those vectors for fast and accurate retrieval.

We want to evaluate different embedding models, so we represent them with a function that accepts text and returns vectors. In order to easily test different models, the function also accepts an `EncoderParams` object that tells it which embedding model to use. See `encoders.embed_strs` for the definition.

#### 2.3: Create vector database

The document embeddings are stored in a vector database for fast vector searching during RAG.

Below is a function that creates a vector database we can use to store documents, and a function to retrieve the embedded documents. The `name` of the database will be determined by the chosen sources, chunking strategy, and embedding model evaluation parameters.

In [None]:
from milvusdb import create_collection, query

#### 2.4: Insert documents

The last step is to insert the embedded documents into the newly created vector database.

An embedded document is represented by its corresponding `vectors`, `texts`, and `metadatas`. The vector database is given by a Milvus `Collection`.

The `milvusdb.upload_data_json` function embeds and uploads documents. We provide it with `texts`, `metadatas`, and the destination `collection_name`. Each collection is associated with a single `EncoderParams`.

In [None]:
from milvusdb import upload_data_json

#### 2.5 Populate the vector database

So far, the parameters we can control are:

1. Chunking strategy parameters
    - chunk hardmax: the maximum number of characters in a chunk
    - chunk softmax: the preferred maximum number of characters in a chunk (see `new_after_n_chars` in `chunk_by_title` for more)
    - overlap: the number of characters to overlap between consecutive chunks
2. Embedding model

There are also some parameters we can't currently control:

1. Document loader (unstructured)
2. Question generation chunking strategy (LangChain's `RecursiveCharacterTextSplitter`)
2. Base RAG chunking strategy (unstructured's `chunk_by_title`)
3. Vector Index (Zilliz autoindex)

Once we decide on values for the parameters we can control, we can chunk and embed our sources into a vector database. We can then use the database to generate a synthetic dataset of question-answer pairs and/or evaluate agents' responses to a generated (or loaded) dataset of questions.

In [None]:
from evaluations import generate_chapter_elements


def populate_database(collection_name: str, chunk_hardmax: int, chunk_softmax: int, overlap: int):
    for src, elements in generate_chapter_elements():
        chunks = chunk_by_title(
            elements,
            max_characters=chunk_hardmax,
            new_after_n_chars=chunk_softmax,
            overlap=overlap,
        )
        num_chunks = len(chunks)
        texts, metadatas = [], []
        for i in range(num_chunks):
            texts.append(chunks[i].text)
            metadatas.append(chunks[i].metadata.to_dict())
        result = upload_data_json(texts, metadatas, collection_name)
        if result["message"] != "Success":
            break

#### 2.6 Reader - LLM 💬

>In this part, the LLM Reader reads the retrieved documents to formulate its answer.

OpenProBono offers various LLM Readers using models offered by OpenAI, HuggingFace, and more. We will create a simple LLM Reader connected to the OpenAI API for this notebook.

In [None]:
RAG_PROMPT_TEMPLATE = """Using the information contained in the context,
give a comprehensive answer to the users question.
Respond only to the question asked, response should be concise and relevant to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
{context}
"""

RAG_PROMPT_UNFORMATTED = (
    "Using the information contained in the context, "
    "give a comprehensive answer to the question. "
    "Respond only to the question asked, response should be concise and relevant to the question. "
    "Provide the number of the source document when relevant. "
    "If the answer cannot be deduced from the context, do not give an answer."
)

In [None]:
from chat_models import chat


def answer_with_rag(
    question: str,
    chat_model: ChatModelParams,
    collection_name: str | None = None,
    k: int = 4,
) -> tuple[str, list[dict]]:
    """Answer a question using RAG."""
    # Gather documents with retriever
    relevant_docs = query(collection_name, question, k)
    relevant_docs = [doc["entity"]["text"] for doc in relevant_docs["result"]]  # keep only the text

    # Build the final prompt context
    context = "".join([f"Document {i!s}:::\n" + doc.rstrip() for i, doc in enumerate(relevant_docs)])
    # Format the args depending on engine
    if chat_model.engine == "hive":
        msg_history = [{"role":"user", "content":question}]
        answer, _ = chat(
            msg_history,
            chat_model,
            temperature=0,
            system=RAG_PROMPT_TEMPLATE.format(context=context),
        )
    else:
        msg_history = [{"role":"system", "content":RAG_PROMPT_TEMPLATE.format(context=context)}, {"role":"user", "content":question}]
        # Redact an answer
        response = chat(
            msg_history,
            chat_model,
            temperature=0,
        )
        answer = response.choices[0].message.content
    return answer, relevant_docs

### 3: Benchmarking the RAG System

>The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evlauation dataset.
>
>To this end, **we setup a judge agent**. ⚖️🤖
>
>Out of the different RAG evaluation metrics, we choose to focus only on faithfulness since it the best end-to-end metric of our system's performance.

>💡 *In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in Prometheus's prompt template: this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples.*
>
>💡 *Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement.*

In [None]:
def run_rag_tests(
    chat_model: ChatModelParams,
    eval_dataset: pd.DataFrame,
    output_file: str,
    collection_name: str | None = None,
    verbose: bool | None = True,
    test_settings: str | None = None,  # To document the test settings used
):
    """Run RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with Path(output_file).open() as f:
            outputs = json.load(f)
    except:
        outputs = []

    for _, example in tqdm(eval_dataset.iterrows(), total=len(eval_dataset)):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, chat_model, collection_name)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": list(relevant_docs),
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with Path(output_file).open("w") as f:
            json.dump(outputs, f)

In [None]:
EVALUATION_SYSTEM_MSG = "You are a fair evaluator language model."

EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

In [None]:
eval_temperature = 0


def evaluate_answers(
    answer_path: str,
    evaluator_llm: ChatModelParams,
) -> None:
    """Evaluate generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if Path(answer_path).is_file():  # load previous generations if they exist
        with Path(answer_path).open() as f:
            answers = json.load(f)

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_llm.model}" in experiment:
            continue

        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_sys_msg = {"role": "system", "content": EVALUATION_SYSTEM_MSG}
        eval_msg = {"role": "user", "content": eval_prompt}
        eval_response = chat([eval_sys_msg, eval_msg], evaluator_llm, temperature=eval_temperature)
        eval_result = eval_response.choices[0].message.content
        feedback, score = (item.strip() for item in eval_result.split("[RESULT]"))
        experiment[f"eval_score_{evaluator_llm.model}"] = score
        experiment[f"eval_feedback_{evaluator_llm.model}"] = feedback

        with Path(answer_path).open(mode="w") as f:
            json.dump(answers, f)

>🚀 Let's run the tests and evaluate answers!👇

Read the dataset from the file:

In [None]:
couples_df = pd.read_json("data/NC-employment/employment_dataset.json")

In [None]:
from datetime import UTC, datetime

vdb_basename = "CourtEval_" + datetime.now(UTC).strftime("%Y%m%d")
idx = 1
gpt_3_5 = ChatModelParams(engine="openai", model=OpenAIModelEnum.gpt_3_5)
gpt_4o = ChatModelParams(engine="openai", model=OpenAIModelEnum.gpt_4_o)
hive_7b = ChatModelParams(engine="hive", model="hive-7b")
hive_70b = ChatModelParams(engine="hive", model="hive-70b")
evaluator_llm = gpt_3_5
reader_models = [hive_7b, hive_70b]

def evaluate_model(
    chat_model: ChatModelParams,
    settings_name: str,
    collection_name: str | None = None,
):
    # run evaluations with a configured knowledge base
    output_file_name = f"data/NC-employment/{settings_name}.json"

    print(f"Running evaluation for {settings_name}:")

    print("Running RAG...")
    run_rag_tests(
        chat_model=chat_model,
        eval_dataset=couples_df,
        output_file=output_file_name,
        collection_name=collection_name,
        verbose=True,
        test_settings=settings_name,
    )

    print("Running evaluation...")
    evaluate_answers(
        output_file_name,
        evaluator_llm,
    )

filenames = {
    gpt_3_5.model: "gpt_35",
    gpt_4o.model: "gpt_4o",
    hive_7b.model: "hive-7b",
    hive_70b.model: "hive-70b",
}


for encoder in [EncoderParams(name=OpenAIModelEnum.embed_small, dim=768)]:
    for chunk_hardmax, chunk_softmax, overlap in [(5000, 2000, 500)]:
        print("Loading knowledge base embeddings...")
        collection_name = "Courtroom5_NCStatutesPDF" # f"{vdb_basename}_{idx}"
        idx += 1
        if not pymilvus.utility.has_collection(collection_name):
            description = (
                f"Hardmax = {chunk_hardmax}, "
                f"Softmax = {chunk_softmax}, "
                f"Overlap = {overlap}, "
                f"Encoder = {encoder.name}."
            )
            create_collection(
                collection_name,
                encoder,
                description,
            )
            populate_database(
                collection_name,
                chunk_hardmax,
                chunk_softmax,
                overlap,
            )
        for llm in [gpt_3_5, gpt_4o]:
            settings_name = filenames[llm.model]
            evaluate_model(
                llm,
                settings_name,
                collection_name,
            )

Plot the average evaluation scores as accuracy percentage:

In [None]:
import matplotlib.pyplot as plt

results_gpt35_indiv = pd.read_json("data/NC-individual/gpt_35.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_indiv = pd.read_json("data/NC-individual/gpt_4o.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_indiv = pd.read_json("data/NC-individual/hive-7b.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_indiv = pd.read_json("data/NC-individual/hive-70b.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

results_gpt35_indiv_2 = pd.read_json("data/NC-individual/gpt_35_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_indiv_2 = pd.read_json("data/NC-individual/gpt_4o_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_indiv_2 = pd.read_json("data/NC-individual/hive-7b_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_indiv_2 = pd.read_json("data/NC-individual/hive-70b_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

results_gpt35_indiv_codify = pd.read_json("data/NC-individual/gpt_35_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_indiv_codify = pd.read_json("data/NC-individual/gpt_4o_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_indiv_codify = pd.read_json("data/NC-individual/hive-7b_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_indiv_codify = pd.read_json("data/NC-individual/hive-70b_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

results_gpt35_indiv_codify_2 = pd.read_json("data/NC-individual/gpt_35_codify_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_indiv_codify_2 = pd.read_json("data/NC-individual/gpt_4o_codify_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_indiv_codify_2 = pd.read_json("data/NC-individual/hive-7b_codify_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_indiv_codify_2 = pd.read_json("data/NC-individual/hive-70b_codify_2.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

results_gpt35_court = pd.read_json("data/NC-court/gpt_35.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_court = pd.read_json("data/NC-court/gpt_4o.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_court = pd.read_json("data/NC-court/hive-7b.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_court = pd.read_json("data/NC-court/hive-70b.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

results_gpt35_court_codify = pd.read_json("data/NC-court/gpt_35_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_court_codify = pd.read_json("data/NC-court/gpt_4o_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_court_codify = pd.read_json("data/NC-court/hive-7b_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_court_codify = pd.read_json("data/NC-court/hive-70b_codify.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

results_gpt35_employ = pd.read_json("data/NC-employment/gpt_35.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_gpt4o_employ = pd.read_json("data/NC-employment/gpt_4o.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive7b_employ = pd.read_json("data/NC-employment/hive-7b.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]
results_hive70b_employ = pd.read_json("data/NC-employment/hive-70b.json", orient='records')["eval_score_gpt-3.5-turbo-0125"]

combined_gpt35 = pd.concat([results_gpt35_indiv, results_gpt35_indiv_2, results_gpt35_indiv_codify, results_gpt35_indiv_codify_2, results_gpt35_court, results_gpt35_court_codify, results_gpt35_employ])
combined_gpt4o = pd.concat([results_gpt4o_indiv, results_gpt4o_indiv_2, results_gpt4o_indiv_codify, results_gpt4o_indiv_codify_2, results_gpt4o_court, results_gpt4o_court_codify, results_gpt4o_employ])
combined_hive_7b = pd.concat([results_hive7b_indiv, results_hive7b_indiv_2, results_hive7b_indiv_codify, results_hive7b_indiv_codify_2, results_hive7b_court, results_hive7b_court_codify, results_hive7b_employ])
combined_hive_70b = pd.concat([results_hive70b_indiv, results_hive70b_indiv_2, results_hive70b_indiv_codify, results_hive70b_indiv_codify_2, results_hive70b_court, results_hive70b_court_codify, results_hive70b_employ])

# Define the x-axis labels and values (evaluation models)
models = ["gpt-3.5-turbo-0125", "gpt-4o", "hive-7b", "hive-70b"]

# Define the y-axis values (average scores)
y_values = [(results_gpt35_employ.mean() / 5) * 100, (results_gpt4o_employ.mean() / 5) * 100, (results_hive7b_employ.mean() / 5) * 100, (results_hive70b_employ.mean() / 5) * 100]

# Create the bar graph
plt.bar(models, y_values, color=["lightgreen", "green", "lightblue", "blue"])

# Add values above each bar
for i, value in enumerate(y_values):
    plt.text(i, value, f"{value:.2f}", ha="center")

# Set title and labels
plt.title("Evaluation Accuracy Comparison - GPT-3.5-turbo-0125 Judge")
plt.xlabel("Model")
plt.ylabel("Average Evaluation Accuracy")

# Show the plot
plt.show()