# OpenProBono RAG Evaluation, Part 2
## Retrieval

This notebook is for turning a knowledge base into a vector database, and evaluating retrieval over the database using question-answer pairs.

If you don't have any question-answer pairs, see Part 1 to create synthetic question-answer pairs from your knowledge base using LLMs.

### 0: Install and import dependencies

In [None]:
%pip install -q tqdm pandas pymilvus

In [None]:
import json
from datetime import UTC, datetime
from pathlib import Path

import pandas as pd
import pymilvus
from tqdm.auto import tqdm

from app.models import ChatModelParams, EncoderParams, OpenAIModelEnum, VoyageModelEnum

### 1: Build our Vector Database

We load data from our knowledge base into a Collection in our Milvus vector database.

To accomplish this, we must:

1. create a vector database
2. extract content from our knowledge base
2. chunk the extracted content
3. embed the chunks
4. insert embeddings into the database

Step 1 is fairly straightforward, and steps 2-5 are done inside of a data loading class called [KnowledgeBase](../knowledge_bases.py#L17).

To use the data loading class, we need to make a subclass of it and implement the [generate_elements()](../knowledge_bases.py#L21) function for step 2. This function extracts content from our sources and returns a tuple with a source name and its extracted content.

The entire process is executed in the [populate_database()](../knowledge_bases.py#L37) function.

We have an example implementation using the NC General Statutes called [KnowledgeBaseNC](../knowledge_bases.py#L122). Let's break down each step and discuss the relevant code.

In [None]:
from app.knowledge_bases import KnowledgeBaseNC

eval_data = KnowledgeBaseNC()

#### 1.0: Create vector database

First things first, we should create an empty database to hold our content after we prepare it.

The content is converted into document embeddings and stored in a vector database for fast vector searching during RAG.

Below is a function that creates a vector database we can use to store documents.

The only required parameter is the `name` of the database, but we should also specify an embedding model using the [EncoderParams](../models.py#L220) parameter.

In [None]:
from app.milvusdb import create_collection

#### 1.1: Chunk sources into documents

>- In this part, **we split the documents from our knowledge base into smaller chunks**: these will be the snippets that are picked by the Retriever, to then be ingested by the Reader LLM as supporting elements for its answer.
>- The goal is to build semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting individual ideas.
>
>Many options exist for text splitting:
>
>- split every n words / characters, but this has the risk of cutting in half paragraphs or even sentences
>- split after n words / character, but only on sentence boundaries
>- **recursive split** tries to preserve even more of the document structure, by processing it tree-like way, splitting first on the largest units (chapters) then recursively splitting on smaller units (paragraphs, sentences).
>
>To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.
>
>[This space](https://huggingface.co/spaces/m-ric/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get.

As mentioned previously, the [populate_database()](../knowledge_bases.py#L37) function contains the code for extracting, chunking, embedding, and uploading documents.

The objects returned from `generate_elements()` are passed into the chunking function, [chunk_by_title()](../knowledge_bases.py#L67).

#### 1.2: Embed documents

>The retriever acts like an internal search engine: given the user query, it returns the most relevant documents from your knowledge base.

An embedding model transforms documents into vectors, and Milvus creates an index over the vectors for fast and accurate retrieval.

We transform documents via the [embed_strs()](../encoders.py#L80) function that accepts a list of strings and an `EncoderParams` object and returns a list of vectors.

This function gets called from [populate_database()](../knowledge_bases.py#L78) after chunking and before uploading.

#### 1.3: Insert documents

Now that we've chunked and embedded our documents, the last step is to insert the document embeddings and metadata into the newly created vector database.

An embedded document is represented by its corresponding `vector`, `text`, and `metadata`.

The [upload_data()](../milvusdb.py#L356) accepts a list of dictionaries containing this data and a destination `collection_name`.

#### 1.4 Populate the vector database

So far, the parameters we can control are:

1. Chunking strategy parameters
    - chunk hardmax: the maximum number of characters in a chunk
    - chunk softmax: the preferred maximum number of characters in a chunk (see `new_after_n_chars` in `chunk_by_title` for more)
    - overlap: the number of characters to overlap between consecutive chunks
2. Embedding model
3. The number of documents to retrieve for a query (k)

There are also some parameters we can't currently control:

1. Document loader (unstructured)
2. Base chunking strategy (`chunk_by_title`)
3. Similarity metric (normalized inner product ~ cosine)
3. Vector Index (Zilliz autoindex)
4. Relevancy threshold (currently not set, but could be anywhere between 0 and 2 assuming embeddings are normalized and cosine distance is used)
5. Reranking (distance is the default ranking)

Once we decide on values for the parameters we can control, we can chunk and embed our sources into a vector database. We can test the retrieval component of the system separately from generation.

#### 1.5: Query the vector database

The last step to benchmark is to call a function to retrieve vectors based on semantic similarity to a query. This function accepts a `collection_name`, a `query` string, and the number `k` of results to return.

In [None]:
from app.milvusdb import query

#### 1.6: Benchmark retrieval

There are many ways to test retrieval from a vector database. We will use two simple metrics for now:

1. Mean Reciprocal Rank (MRR) - where (what rank) does the first relevant document show up in our list of retrieved documents?
2. Precision @ K - out of the K documents we retrieve, how many are relevant?

These metrics rely on classifying a document as relevant or irrelevant, and we can use LLMs for this.

Let's write a prompt to classify a bit of context as relevant or irrelevant towards answering a given question.

In [None]:
CONTEXT_RELEVANCE_PROMPT = """You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Reference text]: {context}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text contains information that can answer the Question.
Please focus on whether the very specific question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated", and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question."""

Now let's write the function to call the LLM to perform this classification task. 

In [None]:
from chat_models import chat

eval_temperature = 0

def classify_context(chat_model: ChatModelParams, question: str, context: str) -> str:
    if chat_model.engine == "hive":
        msg_history = [{"role":"user", "content":question}]
        answer, _ = chat(
            msg_history,
            chat_model,
            temperature=eval_temperature,
            system=CONTEXT_RELEVANCE_PROMPT.format(
                question=question,
                context=context,
            ),
        )
    elif chat_model.engine == "anthropic":
        msg_history = [{"role":"user", "content":question}]
        response = chat(
            msg_history,
            chat_model,
            temperature=eval_temperature,
            system=CONTEXT_RELEVANCE_PROMPT.format(
                question=question,
                context=context,
            ),
        )
        answer = response.content[-1].text
    else:
        msg_history = [
            {
                "role": "system",
                "content": CONTEXT_RELEVANCE_PROMPT.format(
                    question=question,
                    context=context,
                ),
            },
            {
                "role": "user",
                "content": question,
            },
        ]
        response = chat(
            msg_history,
            chat_model,
            temperature=eval_temperature,
        )
        answer = response.choices[0].message.content
    return answer

Next we need a function to perform the queries on our vector database using the questions in the evaluation dataset.

In [None]:
def run_retrieval(
    collection_name: str,
    eval_dataset: pd.DataFrame,
    k: int,
    output_file: str,
    synth_data: bool = True,
):
    try:  # load previous generations if they exist
        with Path(output_file).open() as f:
            outputs = json.load(f)
    except:
        outputs = []

    for _, example in tqdm(eval_dataset.iterrows(), total=len(eval_dataset)):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        # Gather documents with retriever
        relevant_docs = query(collection_name, question, k)
        if not relevant_docs["result"]:
            print("ERROR: no results found")
            continue
        # keep only text and distance
        relevant_docs = [
            {
                "text": doc["entity"]["text"],
                "distance": doc["distance"],
            }
            for doc in relevant_docs["result"]
        ]
        result = {
            "question": question,
            "retrieved_docs": relevant_docs,
        }
        if synth_data:
            result["source"] = example["context"]
        outputs.append(result)

        with Path(output_file).open("w") as f:
            json.dump(outputs, f)

Once we run the retrieval, we perform our evaluation by classifying each retrieved document for each question.

If synthetic data was provided that has the source text the question/answer pair was generated from, we can further evaluate our retrieved documents.

We can compute the longest common substring between the source text and each document and divide this number by the length of the document.

This measures how much of the retrieved document is in the source text.

In [None]:
def longest_common_substring(s1: str, s2: str):
    m = [[0] * (1 + len(s2)) for _ in range(1 + len(s1))]
    longest = 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
    return longest


def evaluate_retrieval(chat_model: ChatModelParams, retrieval_path: str, synth_data: bool = True):
    """Evaluate retrieval. Modifies the given answer file in place for better checkpointing."""
    retrievals = []
    if Path(retrieval_path).is_file():  # load retrieval
        with Path(retrieval_path).open() as f:
            retrievals = json.load(f)

    for experiment in tqdm(retrievals):
        contexts = experiment["retrieved_docs"]
        first_relevant, num_relevant = 0, 0
        for i, context in enumerate(contexts):
            if f"eval_{chat_model.model}" not in context:
                # get the relevant/irrelevant classification from the LLM
                answer = classify_context(chat_model, experiment["question"], context["text"])
                context[f"eval_{chat_model.model}"] = answer
            if "eval_gpt-4o" in context and context["eval_gpt-4o"] == "relevant":
                num_relevant += 1
                if first_relevant == 0:
                    first_relevant = i + 1

            if synth_data and "source_lcspct" not in context:
                # get longest common substring / length of retrieved context
                lcs = longest_common_substring(experiment["source"], context["text"])
                context["source_lcspct"] = lcs / len(context["text"])

            with Path(retrieval_path).open(mode="w") as f:
                json.dump(retrievals, f)

        # overall evaluation metrics MRR and Precision @ k
        experiment["rr"] = 1 / first_relevant if first_relevant > 0 else 0
        experiment["precision"] = num_relevant / len(contexts)

    # write the last row results to file
    with Path(retrieval_path).open(mode="w") as f:
        json.dump(retrievals, f)


Load your `DataFrame` here:

In [None]:
data_dir = "data/NC-employment"
couples_df = pd.read_json(f"{data_dir}/employment_dataset.json")

And finally, run the evaluation!

In [None]:
evaluator_llm = ChatModelParams(engine="openai", model="gpt-4o")

def evaluate_embedding(
    eval_dataset: pd.DataFrame,
    collection_name: str,
    chunk_hardmax: int,
    chunk_softmax: int,
    overlap: int,
    k: int,
    encoder: EncoderParams,
    synth_data: bool = True,
):
    # run evaluations with a configured knowledge base
    settings_name = (
        f"collection_name:{collection_name}"
        f"hardmax:{chunk_hardmax}"
        f"softmax:{chunk_softmax}"
        f"_overlap:{overlap}"
        f"_k:{k}_encoder:{encoder.name}-{encoder.dim}"
    )
    output_file_name = f"{data_dir}/{settings_name}.json"

    print(f"Running evaluation for {settings_name}:")

    print("Running retrieval...")
    run_retrieval(
        collection_name,
        eval_dataset,
        k,
        output_file_name,
        synth_data,
    )

    print("Running evaluation...")
    evaluate_retrieval(evaluator_llm, output_file_name, synth_data)

vdb_basename = "Eval_" + datetime.now(UTC).strftime("%Y%m%d")
idx = 1
for encoder in [EncoderParams(name=VoyageModelEnum.law, dim=1024)]:
    for chunk_hardmax, chunk_softmax, overlap in [(5000, 2000, 500)]:
        print("Loading knowledge base embeddings...")
        collection_name = "Voyage_Courtroom5_NCStatutesPDF" # f"{vdb_basename}_{idx}"
        idx += 1
        if not pymilvus.utility.has_collection(collection_name):
            description = (
                f"Hardmax = {chunk_hardmax}, "
                f"Softmax = {chunk_softmax}, "
                f"Overlap = {overlap}, "
                f"Encoder = {encoder.name}."
            )
            create_collection(
                collection_name,
                encoder,
                description,
            )
            eval_data.populate_database(
                collection_name,
                chunk_hardmax,
                chunk_softmax,
                overlap,
            )
        for k in [2, 4, 6, 8]:
            evaluate_embedding(
                couples_df,
                collection_name,
                chunk_hardmax,
                chunk_softmax,
                overlap,
                k,
                encoder,
            )

Read the results in from output files

In [None]:
results_k2 = pd.read_json(data_dir + "/collection_name:Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:2_encoder:text-embedding-3-small-768.json", orient='records')
results_k4 = pd.read_json(data_dir + "/collection_name:Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:4_encoder:text-embedding-3-small-768.json", orient='records')
results_k6 = pd.read_json(data_dir + "/collection_name:Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:6_encoder:text-embedding-3-small-768.json", orient='records')
results_k8 = pd.read_json(data_dir + "/collection_name:Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:8_encoder:text-embedding-3-small-768.json", orient='records')
results_voyage_k2 = pd.read_json(data_dir + "/collection_name:Voyage_Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:2_encoder:voyage-law-2-1024.json", orient='records')
results_voyage_k4 = pd.read_json(data_dir + "/collection_name:Voyage_Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:4_encoder:voyage-law-2-1024.json", orient='records')
results_voyage_k6 = pd.read_json(data_dir + "/collection_name:Voyage_Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:6_encoder:voyage-law-2-1024.json", orient='records')
results_voyage_k8 = pd.read_json(data_dir + "/collection_name:Voyage_Courtroom5_NCStatutesPDFhardmax:5000softmax:2000_overlap:500_k:8_encoder:voyage-law-2-1024.json", orient='records')

Visualize MRR

In [None]:
import matplotlib.pyplot as plt

# Define the x-axis labels and values (K)
#ks = ["2", "4", "6", "8"]
encoders = ["o2", "v2", "o4", "v4", "o6", "v6", "o8", "v8"]
# Define the y-axis values (average scores)
y_values = [
    results_k2["rr"].mean(),
    results_voyage_k2["rr"].mean(),
    results_k4["rr"].mean(),
    results_voyage_k4["rr"].mean(),
    results_k6["rr"].mean(),
    results_voyage_k6["rr"].mean(),
    results_k8["rr"].mean(),
    results_voyage_k8["rr"].mean(),
]

# Create the bar graph
plt.bar(encoders, y_values)

# Add values above each bar
for i, value in enumerate(y_values):
    plt.text(i, value, f"{value:.2f}", ha="center")

# Set title and labels
plt.title("Retrieval Evaluation - Mean Reciprocal Rank (MRR)")
plt.xlabel("Embedding Model")
plt.ylabel("MRR")

# Show the plot
plt.show()

Visualize Precision @ K

In [None]:
# Define the y-axis values (average scores)
y_values = [
    results_k2["precision"].mean(),
    results_voyage_k2["precision"].mean(),
    results_k4["precision"].mean(),
    results_voyage_k4["precision"].mean(),
    results_k6["precision"].mean(),
    results_voyage_k6["precision"].mean(),
    results_k8["precision"].mean(),
    results_voyage_k8["precision"].mean(),
]

# Create the bar graph
plt.bar(encoders, y_values)

# Add values above each bar
for i, value in enumerate(y_values):
    plt.text(i, value, f"{value:.2f}", ha="center")

# Set title and labels
plt.title("Retrieval Evaluation - Mean Precision")
plt.xlabel("Embedding Model")
plt.ylabel("Mean Precision")

# Show the plot
plt.show()

Visualize the average percentage of a chunk's text that is a substring of the source text that generated the question/answer pair (only available if synthetic or Q/A source data was provided)

In [None]:
lcs_k2 = [sublist[i]["source_lcspct"] for sublist in results_k2["retrieved_docs"] for i in range(len(sublist))]
lcs_k4 = [sublist[i]["source_lcspct"] for sublist in results_k4["retrieved_docs"] for i in range(len(sublist))]
lcs_k6 = [sublist[i]["source_lcspct"] for sublist in results_k6["retrieved_docs"] for i in range(len(sublist))]
lcs_k8 = [sublist[i]["source_lcspct"] for sublist in results_k8["retrieved_docs"] for i in range(len(sublist))]
lcs_voyage_k2 = [sublist[i]["source_lcspct"] for sublist in results_voyage_k2["retrieved_docs"] for i in range(len(sublist))]
lcs_voyage_k4 = [sublist[i]["source_lcspct"] for sublist in results_voyage_k4["retrieved_docs"] for i in range(len(sublist))]
lcs_voyage_k6 = [sublist[i]["source_lcspct"] for sublist in results_voyage_k6["retrieved_docs"] for i in range(len(sublist))]
lcs_voyage_k8 = [sublist[i]["source_lcspct"] for sublist in results_voyage_k8["retrieved_docs"] for i in range(len(sublist))]

# Define the y-axis values
y_values = [
    (sum(lcs_k2) / len(lcs_k2)) * 100,
    (sum(lcs_voyage_k2) / len(lcs_voyage_k2)) * 100,
    (sum(lcs_k4) / len(lcs_k4)) * 100,
    (sum(lcs_voyage_k4) / len(lcs_voyage_k4)) * 100,
    (sum(lcs_k6) / len(lcs_k6)) * 100,
    (sum(lcs_voyage_k6) / len(lcs_voyage_k6)) * 100,
    (sum(lcs_k8) / len(lcs_k8)) * 100,
    (sum(lcs_voyage_k8) / len(lcs_voyage_k8)) * 100,
]

# Create the bar graph
plt.bar(encoders, y_values)

# Add values above each bar
for i, value in enumerate(y_values):
    plt.text(i, value, f"{value:.2f}", ha="center")

# Set title and labels
plt.title("Retrieval Evaluation - Longest Common Substring %")
plt.xlabel("Embedding Model")
plt.ylabel("Average % Substring")

# Show the plot
plt.show()