# RAG evaluator
- [x] Metrics -> on Retriever component
- [x] Generator component -> with LLM-based judge
- [x] evaluation dataset
- [x] llm main style evaluation
- [x] RAGAS https://medium.com/data-science/evaluating-rag-applications-with-ragas-81d67b0ee31a
- [ ] better source data on detailed project and information
- [ ] better golden dataset
- [ ] Dashboard - Once Stable after all the other RAG part


In [56]:
import json
from base_models import TestQuestion


In [57]:
import tqdm


TEST_QUESTIONS_FILE = "../evaluation/eval_data.jsonl"

def load_test_questions() -> list[TestQuestion]:
    """
    Load test questions from a JSONL file
    """
    with open(TEST_QUESTIONS_FILE, "r", encoding="utf-8") as f:
        tests = []
        for line in f:
            data = json.loads(line.strip()) 
            tests.append(TestQuestion(**data))
        print("Loaded {} test questions".format(len(tests)))
    return tests

In [58]:
tests = load_test_questions()

Loaded 50 test questions


In [59]:
tests[0]
print(tests[0].question)
print(tests[0].ground_truth)
print(tests[0].category)
print(tests[0].keywords)

How many teams benefited from Beiji’s n8n rollout and what was the financial impact?
More than 10 teams used the automation workflows, saving about 10,000 SGD per month.
impact
['n8n', '10+ teams', '10k SGD', 'automation']


In [60]:
from collections import Counter
count = Counter([t.category for t in tests])
count

Counter({'ai_engineering': 10,
         'achievement': 5,
         'engineering': 5,
         'personality': 4,
         'platform_engineering': 3,
         'rag_skill': 3,
         'lifestyle': 3,
         'skills': 3,
         'multi_hop': 3,
         'impact': 2,
         'timeline': 2,
         'project_experience': 2,
         'frontend': 1,
         'education': 1,
         'personal_profile': 1,
         'full_stack_ai': 1,
         'future': 1})

In [62]:
from rag_retrieval import fetch_context

retrieval_results = fetch_context(tests[0].question)

print("Retrieval results:")
print(retrieval_results)



NotFoundError: Error getting collection: Collection [67f909e5-1762-40fe-b66b-d9309eaaf2f9] does not exist.

### Generator Part Evaluation

In [37]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from base_models import RetrievalLLMEval
from rag_retrieval import generate_answer

LLM_EVAL_PROMPT = """
You are a helpful assistant that can answer questions about the user's CV and hobbies.
You are given a question and a context.
You need to evaluate the retrieval results based on the context.
User question:
{question}

Generated answer:
{generated_answer}

Golden answer:
{ground_truth}

Evaluation criteria:
- Accuracy: How many of the retrieval results are correct?
- Relevance: How relevant are the retrieval results to the question?
- Completeness: How complete are the retrieval results?
- Confidence: How confident are you in the retrieval results?
- Score: The average of accuracy, relevance, completeness

Return in the following format:
{{
    "accuracy": 3,
    "relevance": 2,
    "completeness": 4,
    "confidence": 0.9,
    "feedback": "The retrieval results is not relevant to the question but correct",
    "score": 3
}}
"""

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0);

def evaluate_response(test_question: TestQuestion) -> RetrievalLLMEval:
    """
    Evaluate the LLM-Response based on the retrieval results, not on the retrieval results based on the question
    """
    # get the context
    generated_answer, retrieval_results = generate_answer(test_question.question)

    # parse the messages
    system_messages = [SystemMessage(
        content=("You are an expert evaluator assessing the quality of answers. Evaluate the generated answer by comparing it to the reference answer. Only give 5/5 scores for perfect answers."
                 ))]
    user_messages = [HumanMessage(content=LLM_EVAL_PROMPT.format(question=test_question.question, generated_answer=generated_answer, ground_truth=test_question.ground_truth))]
    messages = system_messages + user_messages

    structured_llm = llm.with_structured_output(RetrievalLLMEval)
    response_LLM_eval = structured_llm.invoke(messages)

    return response_LLM_eval
    

In [38]:
result = evaluate_response(tests[0])

# Check if it's a BaseModel/RetrievalLLMEval instance
print("Type:", type(result))
print("Is RetrievalLLMEval?", isinstance(result, RetrievalLLMEval))
print("\nResult object:")
print(result)
print("\nAccess attributes:")
print(f"accuracy: {result.accuracy}")
print(f"relevance: {result.relevance}")
print(f"completeness: {result.completeness}")
print(f"score: {result.score}")
print(f"confidence: {result.confidence}")
print(f"feedback: {result.feedback}")

Type: <class 'base_models.RetrievalLLMEval'>
Is RetrievalLLMEval? True

Result object:
accuracy=4.0 relevance=3.0 completeness=4.0 confidence=0.9 feedback='The retrieval results are mostly accurate and complete, but the phrasing could be more aligned with the golden answer.' score=3.75

Access attributes:
accuracy: 4.0
relevance: 3.0
completeness: 4.0
score: 3.75
confidence: 0.9
feedback: The retrieval results are mostly accurate and complete, but the phrasing could be more aligned with the golden answer.


In [39]:
def evaluate_LLM(tests: list[TestQuestion]) -> RetrievalLLMEval:
    """
    Evaluate all the tests
    """
    results = []  
    for test in tests:
        results.append(evaluate_response(test))
    evaluation_result = RetrievalLLMEval(
        accuracy=sum([result.accuracy for result in results]) / len(results),
        relevance=sum([result.relevance for result in results]) / len(results),
        completeness=sum([result.completeness for result in results]) / len(results),
        score=sum([result.score for result in results]) / len(results),
        confidence=sum([result.confidence for result in results]) / len(results),
        feedback="This is the average of all the tests",
    )
    return evaluation_result

eval_result_LLM = evaluate_LLM(tests)

In [41]:
print("\nAverage of all the tests from LLM evaluation on LLM answer:")
print(f"accuracy: {eval_result_LLM.accuracy}")
print(f"relevance: {eval_result_LLM.relevance}")
print(f"completeness: {eval_result_LLM.completeness}")
print(f"score: {eval_result_LLM.score}")
print(f"confidence: {eval_result_LLM.confidence}")


Average of all the tests from LLM evaluation on LLM answer:
accuracy: 3.5
relevance: 2.96
completeness: 3.74
score: 3.44
confidence: 0.8320000000000001


### The metric based evals on RAG retrieval result
- MRR
- Keyword Coverage

In [42]:
def evaluate_mrr(keyword:str, retrieval_results:list) -> float:
    """
    Evaluate the MRR of the retrieval results,
    mrr = 1 -> first result contains the keyword
    mrr = 0.5 -> second result contains the keyword
    mrr = 0 -> no result contains the keyword
    """
    keyword = keyword.lower();
    for rank, result in enumerate(retrieval_results, start=1):
        if keyword in result.page_content.lower():
            return 1/rank
    return 0

In [44]:
from base_models import RetrievalEval


def evaluate_retrieval(test: TestQuestion) -> RetrievalEval:
    """
    Evaluate the retrieval results
    """

    retrieved_docs = fetch_context(test.question)
    mrr_scores = [evaluate_mrr(keyword, retrieved_docs) for keyword in test.keywords]# each keyword need to be calculated separately, so a list of scores
    avg_mrr = sum(mrr_scores) / len(mrr_scores) if mrr_scores else 0.0

    # Calculate keyword coverage
    keywords_found = sum(1 for score in mrr_scores if score > 0)
    total_keywords = len(test.keywords)
    keyword_coverage = (keywords_found / total_keywords * 100) if total_keywords > 0 else 0.0

    return RetrievalEval(
        MRR=avg_mrr,
        keyword_coverage=keyword_coverage,
    )

def evaluate_all(tests: list[TestQuestion]) -> RetrievalEval:
    """
    Evaluate all the tests
    """
    results = []
    for test in tests:
        results.append(evaluate_retrieval(test))
    
    mrr_final = sum(result.MRR for result in results) / len(results)
    keyword_coverage_final = sum(result.keyword_coverage for result in results) / len(results)

    return RetrievalEval(
        MRR=format(mrr_final, ".2f"),
        keyword_coverage=format(keyword_coverage_final, ".2f"),
    )


In [45]:
eval_result_retrieval = evaluate_all(tests)
print("\nAverage of all the tests from retrieval evaluation:")
print(f"MRR: {eval_result_retrieval.MRR}")
print(f"Keyword Coverage: {eval_result_retrieval.keyword_coverage}% ")


Average of all the tests from retrieval evaluation:
MRR: 0.5
Keyword Coverage: 71.27% 


### RAGAS
https://medium.com/data-science/evaluating-rag-applications-with-ragas-81d67b0ee31a

In [46]:
from datasets import Dataset

questions = [test.question for test in tests]
ground_truths = [test.ground_truth for test in tests]
contexts = []
answers = []

"""TODO: reuse the context and generated answer"""
for test in tests:
    test_answer, test_context = generate_answer(test.question)
    answers.append(test_answer)
    # each query have multiple context documents
    contexts.append([doc.page_content for doc in test_context])

dataset = Dataset.from_dict({
    "question": questions,
    "reference": ground_truths,
    "contexts": contexts,
    "answer": answers,
})

In [47]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

from rag_ingestion import embeddings

"""
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
is because only have 1 LLM, and RAGAS is expecting 3 by default, but 1 still works 
"""

result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
    embeddings=embeddings,
    llm=llm,
)

df = result.to_pandas()

  from ragas.metrics import (
  from ragas.metrics import (
  from ragas.metrics import (
  from ragas.metrics import (


Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 g

In [48]:
df.head()

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_precision,context_recall,faithfulness,answer_relevancy
0,How many teams benefited from Beiji’s n8n roll...,[Software Engineer Aug 2023 - Present United O...,Beiji's n8n rollout benefited over 10 teams an...,More than 10 teams used the automation workflo...,1.0,1.0,1.0,0.760898
1,Which two internal platforms did Beiji build t...,[for 200+ templates. • Built in-house Develope...,Beiji built two internal platforms to improve ...,He built an in-house Developer Portal and work...,1.0,1.0,1.0,0.943257
2,What access control problem did the Prompt Tem...,[for 200+ templates. • Built in-house Develope...,The Prompt Template Hub solved the access cont...,It provided version control and access control...,0.416667,1.0,0.6,0.903517
3,Which project of Beiji’s is being scaled by an...,[What truly sets Beiji apart is his innate dri...,The project being scaled by an enterprise GenA...,The LLM-Based Bitbucket Code Reviewer is being...,0.5,0.0,0.714286,0.717903
4,What retrieval techniques improved the Best Pr...,[LoRA and AI Agents | Udemy • Fine-tuned Llama...,The retrieval techniques that improved the Bes...,"Query rewriting, reranking, and optimized embe...",0.75,1.0,0.714286,0.992616


In [49]:
print("contect precision: ", format(df["context_precision"].mean(), ".2f"))
print("contect recall: ", format(df["context_recall"].mean(), ".2f"))
print("faithfulness: ", format(df["faithfulness"].mean(), ".2f"))
print("answer relevancy: ", format(df["answer_relevancy"].mean(), ".2f"))

contect precision:  0.50
contect recall:  0.69
faithfulness:  0.83
answer relevancy:  0.78
