# My Evaluation Approach


![](assets/my_approach.png)


1. **Document Preparation**: Load the entire document and extract content.
2. **Question Generation**: Use an LLM to generate broad and detailed questions about the document.
3. **Chunking Application**: Apply various chunking methods to the document. The next steps will be executed for each chunking strategy separately.
4. **Chunk-Question Relevancy Scoring**: For each generated question, instruct an LLM to grade all chunks by relevancy to the question. These chunks become the ground truth and will represent our silver evaluation dataset.
5. **Human Annotation**: Human annotators then review and modify the silver dataset to produce the gold dataset.
6. **Chunk Retrieval**: Embed the chunks and retrieve the most similar chunks to each question.
7. **Evaluate Retrieval**: Compare the retrieved chunks with the synthesized ground truth chunks and calculate the Precision, Recall and nDCG to analyse the retrieval performance

## Setup


In [1]:
%load_ext autoreload
%autoreload 2

: 

In [2]:
import json
import time
import os
from typing import List, Dict, TypedDict
from pathlib import Path
from tqdm import tqdm
import pandas as pd
import nest_asyncio
from dotenv import load_dotenv

import openai
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import VectorStore
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser

In [3]:
loaded = load_dotenv(override=True)

data_dir = "data/"
os.environ['CHUNKING_BENCHMARK_DATADIR'] = data_dir

# 1. Load and Save Documents


Each document is loaded as one Langchain document possibly to small to fit into a LLM. Therefore, we need to split these documents into smaller pieces of text for further processing.

In [None]:
from utils.data_loader import save_documents

documents: List[Document] = []
for file in os.listdir(data_dir+"documents"):
    file_path = os.path.join(data_dir+"documents", file)
    loader = TextLoader(file_path)
    documents.extend(loader.load())

save_documents(documents, data_dir)

In [4]:
from utils.data_loader import load_documents
documents = load_documents(data_dir)

# 2. Question Generation


Generate synthetic Questions across Documents to challenge chunking strategies on multi-context queries


In [24]:
# Question generation based on documents
class Question(BaseModel):
    question: str = Field(description="The question generated by the model")
    type: str = Field(description="The type of question generated")

class Questions(BaseModel):
    questions: List[Question] = Field(description="The list of questions generated by the model")

parser = JsonOutputParser(pydantic_object=Questions)

question_generation_prompt = PromptTemplate(
    input_variables=["document"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
    template="""
You are a highly knowledgeable assistant tasked with generating challenging questions to evaluate different document chunking strategies in retrieval augmented generation (RAG) pipelines. 

<document>
{document}
</document>

Based on the document provided, generate a set of questions that meet the following criteria:
1. **Specific and Detailed Question (type: detailed):**  
   - This question should be highly specific and precise, targeting one particular section, name, date, event, or factual detail within the document.
   - The question should be designed to challenge chunking strategies that include too much context, potentially leading to noise or irrelevant information being retrieved. The answer should require precise retrieval of a specific detail without confusion from surrounding content.
2. **Broad and Complex Question (type: broad):**  
   - This question should be complex, requiring a comprehensive understanding of the document as a whole, involving multiple interlinked facts or concepts.
   - The question should challenge chunking strategies that use chunks that are too small, making it difficult to retrieve and synthesize a well-rounded answer from fragmented pieces of information. The answer should require a broad perspective and the ability to connect multiple sections of the document.

For each question type please generate 2 such questions about the provided document. Use the following format:
{format_instructions}
"""
)

generator_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

relevancy_chain = question_generation_prompt | generator_llm | parser
questions = []
for document in documents[:6]:
    doc_questions = relevancy_chain.invoke({"document": document.page_content})["questions"]
    for question in doc_questions:
        question["source"] = document.metadata["source"]

    questions.extend(doc_questions)

In [25]:
with open(f"{data_dir}synthetic_questions.json", "w") as f:
    json.dump(questions, f)

# 3. Apply chunking


Execute another notebook containing various chunking strategies which are applied on the documents. You can also run this notebook seperately to have manuel control over each method.

In [None]:
%run -i chunking_strategies.ipynb

Load split chunks from disk:

In [5]:
from utils.data_loader import load_chunks
split_chunks: Dict[str, Document] = load_chunks(data_dir)

Analyze chunk count and average chunk size per strategy:

In [61]:
df = pd.DataFrame(columns=["Experiment", "Chunk Count", "Average Chunk Size"])
for experiment_name, chunks in split_chunks.items():
    df.loc[len(df)] = [experiment_name, len(chunks), round(sum([len(chunk.page_content) for chunk in chunks])/len(chunks))]

df.sort_values(by="Chunk Count", ascending=True).style.hide(axis="index")

Experiment,Chunk Count,Average Chunk Size
fixed_size-2048-0,65,1915
semantic_chunks_95,67,1848
fixed_size-2048-200,72,1902
recursive-2048-0,72,1728
recursive-2048-200,73,1736
semantic_chunks_90,118,1049
fixed_size-1024-0,127,980
markdown_header_parent,146,873
markdown_header,146,844
fixed_size-1024-200,153,1000


# 4. (5.) Create Evaluation Datasets

### Generate Relevancy Score for each chunk


In [6]:
with open(f"{data_dir}synthetic_questions.json", "r") as f:
    questions = json.load(f)

Initialize datasets with questions

In [33]:
from utils.metrics import Testset

datasets: Dict[str, List[Testset]]  = {}
for experiment_name in split_chunks.keys():
    datasets[experiment_name] = []
    for question in questions:
        datasets[experiment_name].append({
            "question": question["question"],
            "source": question["source"],
            "type": question["type"],
            "ground_truth_chunks": {} 
        })

Relevancy Prompt is taken by Trulens. The difference is that I apply it to all chunks whereas Trulens only computed it on the retrieved chunks


In [None]:
from gen_ai_hub.proxy.langchain import init_llm

from concurrent.futures import ThreadPoolExecutor, as_completed

class Grading(BaseModel):
    score: float = Field(description="The score provided by the grader")

parser = JsonOutputParser(pydantic_object=Grading)

judge_chunk_relevancy_prompt = PromptTemplate(
    input_variables=["question", "context", "document"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
    template="""
You are a RELEVANCE grader. Given a CONTEXT from a DOCUMENT and a QUESTION, provide a score measuring the relevance between the CONTEXT and the QUESTION.
The DOCUMENT acts as the ground truth and contains all necessary information to answer the QUESTION fully.
The **RELEVANCE score** evaluates how well the CONTEXT helps answer the specific QUESTION in comparison to other parts of the DOCUMENT.

Scoring Guidelines (0-4):
- **0**: The CONTEXT does not answer the QUESTION at all.
- **1**: The CONTEXT provides very minimal relevant information that doesn't contribute meaningfully to answering the QUESTION.
- **2**: The CONTEXT answers part of the QUESTION but misses key details or is lacking in clarity.
- **3**: The CONTEXT provides a mostly complete answer, but may lack a small detail or nuanced information that would make it fully comprehensive.
- **4**: The CONTEXT contains the direct and complete information needed to fully and thoroughly answer the QUESTION.

<DOCUMENT>
{document}
</DOCUMENT>

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>

Respond in the following format:
{format_instructions}
"""
)

judge_chunk_noise_prompt = PromptTemplate(
    input_variables=["question", "context"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
    template="""
You are a NOISE grader. Given a CONTEXT and a QUESTION, provide a noise score reflecting the amount of **non-contributing** or **excessive** information in the CONTEXT that distracts from answering the QUESTION.

# Scoring Guidelines (0-4):
The CONTEXT contains...
- **0**: no irrelevant or excessive information.
- **1**: some (1 - 4 sentences) irrelevant or excessive information, but most content is useful.
- **2**: a moderate amount (5 - 10 sentences) of irrelevant or excessive content.
- **3**: a significant amount (more than 10 sentences) of irrelevant or excessive content, though some useful parts remain..
- **4**: mostly (multiple paragraphs) irrelevant or excessive content, making it difficult to find the relevant information.

# CONTEXT and QUESTION to be graded:
<CONTEXT>
{context}
</CONTEXT>
<QUESTION>
{question}
</QUESTION>

Output the noise score in the following format:
{format_instructions}
"""
)

critic_llm = init_llm(model_name="gpt-4o", temperature=0)
# critic_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
relevancy_chain = judge_chunk_relevancy_prompt | critic_llm | parser
noise_chain = judge_chunk_noise_prompt | critic_llm | parser

def make_request_with_backoff(question, context, document, retries=8):
    for i in range(retries):
        try:
            relevancy_score = relevancy_chain.invoke({"question": question, "context": context, "document": document})["score"]
            noise_score = 5
            if relevancy_score != 0:
                noise_score = noise_chain.invoke({"question": question, "context": context})["score"]
            return relevancy_score, noise_score
        except openai.RateLimitError as e:
            if i == retries - 1:
                raise e
            wait_time = 2**i
            print(f"Rate limited, waiting {wait_time} seconds")
            time.sleep(wait_time)
        except openai.APIError as e:
            print(e)


def process_chunk(chunk, testset, document_content):
    if chunk.metadata["source"] not in testset["source"]:
        return None, None, None

    chunk_relevancy, chunk_noise = make_request_with_backoff(testset["question"], context=chunk.page_content, document=document_content)
    if chunk_relevancy != 0.0:
        return str(chunk.metadata["id"]), chunk_relevancy, chunk_noise
    return None, None, None

for experiment_name, questions in datasets.items():
    if os.path.exists(f"{data_dir}/datasets/{experiment_name}.json") or experiment_name != "recursive-512-0":
        continue

    print("Collecting ground truth for", experiment_name)
    for testset in tqdm(questions):
        document_content = ""
        for document in documents:
            if document.metadata["source"] == testset["source"]:
                document_content = document.page_content
                break
        ground_truth = {}
        with ThreadPoolExecutor(max_workers=2) as executor:
            future_to_chunk = {
                executor.submit(process_chunk, chunk, testset, document_content): chunk
                for chunk in split_chunks[experiment_name]
            }
            for future in as_completed(future_to_chunk):
                chunk_id, relevancy, noise = future.result()
                if chunk_id and relevancy:
                    ground_truth[chunk_id] = (relevancy, noise)
        
        if len(ground_truth):
            testset["ground_truth_chunks"] = ground_truth

    with open(f"{data_dir}/datasets/{experiment_name}.json", "w") as f:
        json.dump(questions, f)

### Analyse dataset

In [41]:
from utils.data_loader import load_datasets
datasets = load_datasets(data_dir)

df = pd.DataFrame(columns=[ "Experiment", "Chunk Count", "Average Chunk Size", "Average Ground Truth Count","Average Relevancy per Ground Truth 0-4", "Average Noise per Ground Truth 0-5"])
for experiment_name, questions in datasets.items():
    total_relevancy = 0
    total_noise = 0
    total_ground_truth_count = 0
    question_count = 0
    for question in questions:
        # if question["type"] != "detailed":
        #     continue
        for chunk_relevancy, chunk_noise in question["ground_truth_chunks"].values():
            total_relevancy += chunk_relevancy
            total_noise += chunk_noise
        total_ground_truth_count += len(question["ground_truth_chunks"])
        question_count += 1
    
    average_ground_truth_count = total_ground_truth_count / question_count
    average_relevancy_per_ground_truth = total_relevancy / total_ground_truth_count
    average_noise_per_ground_truth = total_noise / total_ground_truth_count
    average_chunk_size = round(sum([len(chunk.page_content) for chunk in split_chunks[experiment_name]]) / len(split_chunks[experiment_name]))
    df.loc[len(df)] = [experiment_name, len(split_chunks[experiment_name]), average_chunk_size, average_ground_truth_count, average_relevancy_per_ground_truth, average_noise_per_ground_truth ]

df.sort_values(by="Average Chunk Size", ascending=False).style.hide(axis="index")

Experiment,Chunk Count,Average Chunk Size,Average Ground Truth Count,Average Relevancy per Ground Truth 0-4,Average Noise per Ground Truth 0-5
markdown_header_parent,146,873,2.625,2.492063,2.666667
recursive-512-0,353,351,5.916667,1.71831,3.140845


Average Total Ground Truth Token Size

In [8]:
mean_ground_truth_token_count = 0
for experiment_name, questions in datasets.items():
    ground_character_token_count = 0
    for question in questions:
        for split_chunk in split_chunks[experiment_name]:
            if str(split_chunk.metadata["id"]) in question["ground_truth_chunks"]:
                ground_character_token_count += len(split_chunk.page_content)
    mean_ground_truth_token_count += ground_character_token_count / (len(questions)*4) # *4 because on token is approximately 4 characters

round(mean_ground_truth_token_count / len(datasets))

3343

# 6. - 7. Evaluation


## Ingest Chunks into Vector Store

Using FAISS


In [103]:
# from langchain_huggingface import HuggingFaceEmbeddings

vector_stores: Dict[str, VectorStore] = {}

# embeddings = HuggingFaceEmbeddings(
#     model_name="Snowflake/snowflake-arctic-embed-l",
    # model_name="Alibaba-NLP/gte-large-en-v1.5",
#     model_kwargs={"device": 0, 'trust_remote_code': True},  # Comment out to use CPU
# )
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

model_name = (embeddings.model_name if hasattr(embeddings, 'model_name') else embeddings.model).replace("/", "_")
vector_store_dir = f"{data_dir}vector_stores/{model_name}"
Path(vector_store_dir).mkdir(parents=True, exist_ok=True)
for experiment_name, chunks in split_chunks.items():
    if os.path.exists(f"{vector_store_dir}/{experiment_name}"):
        print("Loading", experiment_name)
        vector_stores[experiment_name] = FAISS.load_local(f"{vector_store_dir}/{experiment_name}", embeddings, allow_dangerous_deserialization=True)
    else:
        print("Indexing", experiment_name)
        vector_stores[experiment_name] = FAISS.from_documents(chunks, embeddings)
        vector_stores[experiment_name].save_local(f"{vector_store_dir}/{experiment_name}")

Loading recursive-512-0
Loading markdown_header_parent
Loading fixed_size-512-200
Loading recursive-512-200
Loading markdown_header
Loading recursive-1024-200
Loading recursive-1024-0
Loading fixed_size-2048-200
Loading fixed_size-2048-0
Loading semantic_chunks_90
Loading fixed_size-512-0
Loading recursive-2048-200
Loading fixed_size-1024-0
Loading fixed_size-1024-200
Loading semantic_chunks_95
Loading recursive-2048-0


## Evaluate Retrieval

Select Evaluation Approach and load results if already exists

In [121]:
class EvalApproach:
    FIXED_K = "Fixed-K"
    GROUND_TRUTH_K = "Ground-Truth-K"
    TOKEN_LIMIT = "Token-Limit"
    RATIO_K = "Ratio-K"

SEL_APPROACH: EvalApproach = EvalApproach.RATIO_K
FIXED_K = 20
TOKEN_LIMIT = 3340
RATIO_K = 0.1

model_name = "text-embedding-3-small"
eval_name = f"{SEL_APPROACH}-{FIXED_K}-{model_name}" if SEL_APPROACH == EvalApproach.FIXED_K else ""
eval_name = f"{SEL_APPROACH}-{model_name}" if SEL_APPROACH == EvalApproach.GROUND_TRUTH_K else eval_name
eval_name = f"{SEL_APPROACH}-{TOKEN_LIMIT}-{model_name}" if SEL_APPROACH == EvalApproach.TOKEN_LIMIT else eval_name
eval_name = f"{SEL_APPROACH}-{RATIO_K}-{model_name}" if SEL_APPROACH == EvalApproach.RATIO_K else eval_name
if os.path.exists(f"{data_dir}results/{eval_name}.csv"):
    results = pd.read_csv(f"{data_dir}results/{eval_name}.csv")

To generate new results run the following. It calculates the retrieval metrics for the loaded dataset.

In [105]:
from utils.data_loader import load_datasets
datasets = load_datasets(data_dir)

In [122]:
from utils.metrics import calculate_metrics, calculate_mean_metrics

results_list = []
for experiment_name, questions in datasets.items():
    if experiment_name not in vector_stores:
        continue

    K = FIXED_K if SEL_APPROACH == EvalApproach.FIXED_K else 0
    K = (
        round(len(split_chunks[experiment_name]) * RATIO_K)
        if SEL_APPROACH == EvalApproach.RATIO_K
        else K
    )
    K = (
        200 if SEL_APPROACH == EvalApproach.TOKEN_LIMIT else K
    )  # large number to ensure TOKEN_LIMIT is always reached
    print("Evaluating", experiment_name, "with K =", K if K else "Ground Truth based")
    metrics = []
    for testset in tqdm(questions):
        if testset["ground_truth_chunks"] == {}:
            continue
        question = testset["question"]
        ground_truth = testset["ground_truth_chunks"]
        K = len(ground_truth) if SEL_APPROACH == EvalApproach.GROUND_TRUTH_K else K

        retriever = vector_stores[experiment_name].as_retriever(search_kwargs={"k": K})
        retrieved_chunks = retriever.invoke(question)

        if SEL_APPROACH == EvalApproach.TOKEN_LIMIT:
            # cap the number of retrieved chunks where sum of page_contents are below a fixed context window
            retrieved_chunks_capped = []
            total_context_length = 0
            for chunk in retrieved_chunks:
                total_context_length += len(chunk.page_content)
                if (
                    total_context_length > TOKEN_LIMIT * 4
                ):  # as one token on average is approximately 4 characters
                    break
                retrieved_chunks_capped.append(chunk)

            retrieved_chunks = retrieved_chunks_capped

        retrieved_chunk_ids = [str(doc.metadata["id"]) for doc in retrieved_chunks]
        metrics.append(
            calculate_metrics(
                retrieved_chunk_ids,
                ground_truth_chunks=list(ground_truth.keys()),
                ground_truth_relevancies=[relevancy for relevancy, _ in ground_truth.values()],
                ground_truth_noises=[noise for _, noise in ground_truth.values()],
            )
        )

    mean_metrics = (
        calculate_mean_metrics(metrics)
        if len(metrics)
        else {
            "precision": 0.0,
            "noise": 0.0,
            "recall": 0.0,
            "map": 0.0,
            "ndcg": 0.0,
        }
    )

    try:
        experiment_chunk_size = int(experiment_name.split("-")[-2])
        experiment_chunk_overlap = int(experiment_name.split("-")[-1])
    except:
        experiment_chunk_size = None
        experiment_chunk_overlap = None

    results_list.append(
        [
            experiment_name.split("-")[0],
            experiment_chunk_size,
            experiment_chunk_overlap,
            mean_metrics["precision"],
            mean_metrics["noise"] / 5,
            mean_metrics["recall"],
            mean_metrics["map"],
            mean_metrics["ndcg"],
        ]
    )

results = pd.DataFrame(
    results_list,
    columns=[
        eval_name,
        "Chunk Size",
        "Chunk Overlap",
        "Precision",
        "Noise",
        "Recall",
        "MAP",
        "NDCG",
    ],
)
results.to_csv(f"{data_dir}results/{eval_name}.csv", index=False)

Evaluating semantic_chunks_95 with K = 7


100%|██████████| 24/24 [00:08<00:00,  2.79it/s]


Evaluating recursive-2048-0 with K = 7


100%|██████████| 24/24 [00:07<00:00,  3.34it/s]


Evaluating markdown_header_parent with K = 15


100%|██████████| 24/24 [00:06<00:00,  3.50it/s]


Evaluating fixed_size-2048-0 with K = 6


100%|██████████| 24/24 [00:07<00:00,  3.08it/s]


### Best Chunking Strategy

In [123]:
results.drop(columns=["MAP"]).groupby([eval_name, "Chunk Size", "Chunk Overlap"], dropna=False).mean().sort_values(by="Recall", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Precision,Noise,Recall,NDCG
Ratio-K-0.1-text-embedding-3-small,Chunk Size,Chunk Overlap,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
markdown_header_parent,,,0.175,0.911111,1.0,0.928226
semantic_chunks_95,,,0.309524,0.863095,0.994048,0.944802
fixed_size,2048.0,0.0,0.381944,0.829167,0.986111,0.913412
recursive,2048.0,0.0,0.339286,0.840476,0.952083,0.874066


In [13]:
# calculate F1 score
results["F1"] = 2 * (results["Precision"] * results["Recall"]) / (results["Precision"] + results["Recall"])
results["F1"].mean()

0.799677814146692

### Best Chunk Size

In [114]:
results.drop(columns=[eval_name, "Chunk Overlap", "MAP"]).groupby("Chunk Size", dropna=False).mean().sort_values(by="Recall", ascending=False)

Unnamed: 0_level_0,Precision,Noise,Recall,NDCG
Chunk Size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,0.768188,5.803704,0.87891,0.937948
512.0,0.75119,5.959524,0.838851,0.888149
2048.0,0.813161,5.455026,0.827984,0.942866


### Best Overlap

In [15]:
results.where(results[eval_name] == "recursive").drop(columns=[eval_name, "Chunk Size", "MAP"]).groupby("Chunk Overlap").mean().sort_values(by="NDCG", ascending=False)

Unnamed: 0_level_0,Precision,Recall,NDCG,F1
Chunk Overlap,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.763958,0.844961,0.922739,0.802406
200.0,0.773269,0.83868,0.918079,0.804549


## Evaluate Generation (TODO)

In [None]:

nest_asyncio.apply()

answer_correctness_system_prompt = """You are a CORRECTNESS grader; providing the correctness of the given GENERATED ANSWER compared to the given GROUND TRUTH ANSWER.
Respond only as a number from 0 to 10 where 0 is the least correct and 10 is the most correct.

A few additional scoring guidelines:

- Long GENERATED ANSWERS should score equally well as short GENERATED ANSWERS.

- CORRECTNESS score should increase as the GENERATED ANSWER matches more accurately with the GROUND TRUTH ANSWER.

- CORRECTNESS score should increase as the GENERATED ANSWER covers more parts of the GROUND TRUTH ANSWER accurately.

- GENERATED ANSWERS that partially match the GROUND TRUTH ANSWER should score 2, 3, or 4. Higher scores indicate more correctness.

- GENERATED ANSWERS that mostly match the GROUND TRUTH ANSWER should get a score of 5, 6, 7, or 8. Higher scores indicate more correctness.

- GENERATED ANSWERS that fully match the GROUND TRUTH ANSWER should get a score of 9 or 10. Higher scores indicate more correctness.

- GENERATED ANSWERS must be fully accurate and comprehensive to the GROUND TRUTH ANSWER to get a score of 10.

- Never elaborate."""

answer_correctness_user_prompt = PromptTemplate.from_template(
    """GROUND TRUTH ANSWER: {ground_truth_answer}

GENERATED ANSWER: {generated_answer}

CORRECTNESS: """
)

prompt = hub.pull("rlm/rag-prompt")
generator_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
        
for experiment_name, questions in datasets.items():
    print("Evaluating", experiment_name)
    vector_stores[experiment_name].embeddings.show_progress_bar = False
    retriever = vector_stores[experiment_name].as_retriever(search_kwargs={"k": 10})
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | generator_llm
        | StrOutputParser()
    )
    for question_type, datasets in questions.items():
        mean_answer_correctness = 0
        for testset in datasets:
            response = rag_chain.invoke(testset["question"])
            answer_correctness_prompt = answer_correctness_user_prompt.format(
                ground_truth_answer=testset["ground_truth_answer"], generated_answer=response
            )

            llm_messages = [
                SystemMessage(content=answer_correctness_system_prompt),
                HumanMessage(content=answer_correctness_prompt),
            ]
            response = make_request_with_backoff(llm_messages)

            answer_correctness = re_0_10_rating(response.content)
            mean_answer_correctness += answer_correctness
        mean_answer_correctness /= len(datasets)
        print(f"Experiment: {experiment_name} Question Type: {question_type} Mean Answer Correctness: {mean_answer_correctness}")