# RAG Evaluation Notebook

This notebook evaluates the performance of a Retrieval-Augmented Generation (RAG) system. It loads test cases and measures:
- **Retrieval quality**: How well the system retrieves relevant documents
- **Answer quality**: How accurate, complete, and relevant the generated answers are

The evaluation framework uses predefined test cases with reference answers and expected keywords.

In [None]:
# Set up Python path to access project modules
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))
sys.path.append(project_root)

In [None]:
# Import the test loader to access the evaluation test cases
from interface_v1.evaluations.test_loader import load_tests

## Step 1: Load Test Cases

Load the test dataset containing questions, reference answers, categories, and keywords.

In [None]:
# Load all test cases from the test dataset
tests = load_tests()

In [None]:
# Display the total number of test cases
len(tests)

## Step 2: Inspect Test Data

Each test case contains:
- **question**: The user query
- **category**: The topic/domain (e.g., insurance product, company info)
- **reference_answer**: The ground truth answer
- **keywords**: Expected terms that should appear in correct answers

In [None]:
# Examine the structure of a test case
example = tests[0]
print(f"Question: {example.question}")
print(f"Category: {example.category}")
print(f"Reference Answer: {example.reference_answer}")
print(f"Keywords: {example.keywords}")

## Step 3: Analyze Test Distribution

Check how many test cases exist for each category.

In [None]:
# Analyze the distribution of test cases by category
from collections import Counter
count = Counter([t.category for t in tests])
count

## Step 4: Evaluate RAG System Performance

Use evaluation functions to measure retrieval quality and answer generation quality.

In [None]:
# Import evaluation functions for retrieval and answer quality assessment
from interface_v1.evaluations.evaluations import evaluate_retrieval, evaluate_answer

In [None]:
# Evaluate how well the system retrieves relevant documents for the example query
evaluate_retrieval(example)

In [None]:
# Evaluate the quality of the generated answer, including accuracy, completeness, and relevance
eval, answer, chunks = evaluate_answer(example)

In [None]:
# Display the evaluation results
print(answer)
print(eval)

## Evaluation Metrics

The evaluation provides the following metrics:
- **Accuracy**: How factually correct the answer is (0-1 scale)
- **Completeness**: How thoroughly the question is answered (0-1 scale)
- **Relevance**: How relevant the retrieved documents are to the query (0-1 scale)
- **Feedback**: Detailed comments on strengths and areas for improvement

In [None]:
# Display detailed evaluation metrics
print(f"Feedback: {eval.feedback}")
print(f"Accuracy: {eval.accuracy}")
print(f"Completeness: {eval.completeness}")
print(f"Relevance: {eval.relevance}")