**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Retriever Evaluation Metrics

## Overview
- This notebook demonstrates how to evaluate RAG system retrievers using DeepEval metrics
- It focuses on three key evaluation metrics for assessing retrieval quality in RAG pipelines

## Key Evaluation Metrics

### **Contextual Precision**
- **Purpose**: Measures whether relevant document chunks are ranked higher than irrelevant ones
- **Input Requirements**: Query, actual output, expected output, and retrieval context
- **Scoring**: Evaluates ranking quality of retrieved documents (0.0 to 1.0)
- **Use Case**: Assesses how well the retriever prioritizes relevant information

### **Contextual Recall**
- **Purpose**: Measures how well the retrieval context aligns with expected output
- **Input Requirements**: Query, actual output, expected output, and retrieval context
- **Scoring**: Evaluates coverage of expected information in retrieved documents
- **Use Case**: Determines if retriever captures all necessary information

### **Contextual Relevancy**
- **Purpose**: Measures overall relevance of retrieved context for a given query
- **Input Requirements**: Query, actual output, and retrieval context
- **Scoring**: Evaluates general relevance of all retrieved information
- **Use Case**: Assesses overall quality of retrieved content

## Technical Implementation

- **DeepEval Framework**: Uses DeepEval's LLM-based evaluation metrics
- **LLM Judge**: GPT-4o model evaluates relevance and provides reasoning
- **Test Cases**: Creates LLMTestCase objects for systematic evaluation
- **Thresholds**: Configurable success thresholds (default: 0.5)
- **Verbose Mode**: Provides detailed reasoning for metric scores

## Evaluation Process

1. **Setup**: Run existing RAG pipeline to get retrieval results
2. **Context Preparation**: Extract and format retrieved documents
3. **Metric Configuration**: Set up evaluation parameters and thresholds
4. **Testing**: Run evaluation on test cases with different contexts
5. **Analysis**: Review scores, reasons, and pass/fail results

## Benefits

- **Quality Assurance**: Systematic evaluation of retrieval performance
- **Debugging**: Identifies issues with document ranking and relevance
- **Optimization**: Provides metrics to improve retriever performance
- **Transparency**: Clear reasoning for evaluation scores

In [None]:
%run Build_RAG_Pipeline_with_Source.ipynb

# Retriever Evaluation Metrics

![](https://i.imgur.com/5S4FhMB.png)

The retrieval process generally includes these steps:

- Convert the initial input query into an embedding using an embedding model of your choice (e.g., OpenAI's `text-embedding-3` model).
- Conduct a vector search with the embedded input on a vector database that holds your vectorized knowledge base, retrieving the top-K most "similar" document chunks.
- Optionally user a Reranker to rerank the retrieved results


Key Metrics to Evaluate here include:

- Contextual Precision
- Contextual Recall
- Contextual Relevancy

## Contextual Precision

The contextual precision metric measures your RAG pipeline's retriever by evaluating whether document chunks (nodes) in your `retrieval_context` that are relevant to the given `input` are ranked higher than irrelevant ones.

`deepeval`'s contextual precision metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score using an LLM as a judge.

In `deepeval`, to use the ContextualPrecisionMetric, you'll have to provide the following arguments when creating an `LLMTestCase`:

- `input` : Input Query
- `actual_output` : Actual LLM Response (not used in the computation)
- `expected_output` : Expected LLM Response (ground truth answer)
- `retrieval_context` : Top-N retrieved document chunks (nodes) from Vector DB


![](https://i.imgur.com/oVwrRAU.png)





In [None]:
query = "What is AI?"
response = rag_chain_w_sources.invoke(query)
response

### Example:

In [None]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

In [None]:
human_answer = """AI, also known as Artificial Intelligence is used to build complex systems for applications
                  like virtual assistants, robotics and autonomous vehicles."""

In [None]:
new_context = ['Machine Learning is the study of algorithms which learn with more data',
               'AI is known as Artificial Intelligence'] + retrieved_context
new_context

In [None]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import ContextualPrecisionMetric
from deepeval import evaluate

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
    expected_output=human_answer,
    retrieval_context=new_context
)

metric = ContextualPrecisionMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

In [None]:
print(result)

In [None]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

## Contextual Recall

The contextual recall metric measures the quality of your RAG pipeline's retriever by evaluating the extent of which the `retrieval_context` aligns with the `expected_output`.

`deepeval`'s contextual recall metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score using an LLM as a Judge.

In `deepeval`, to use the ContextualRecallMetric, you'll have to provide the following arguments when creating an `LLMTestCase`:

- `input` : Input Query (not used in the computation)
- `actual_output` : Actual LLM Response (not used in the computation)
- `expected_output` : Expected LLM Response (ground truth answer)
- `retrieval_context` : Top-N retrieved document chunks (nodes) from Vector DB


![](https://i.imgur.com/PDbwuX5.png)





In [None]:
query = "What is AI?"
response = rag_chain_w_sources.invoke(query)
response

### Example 1:

In [None]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

In [None]:
retrieved_context

In [None]:
human_answer = """AI, also known as Artificial Intelligence is used to build complex systems for applications
                  like virtual assistants, robotics and autonomous vehicles."""

In [None]:
new_context = ['NVIDIA makes chips for AI', 'AI is an acronym for Artificial Intellence']
new_context

In [None]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import ContextualRecallMetric
from deepeval import evaluate

test_case1 = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

test_case2 = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
    expected_output=human_answer,
    retrieval_context=new_context
)

metric = ContextualRecallMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case1, test_case2], [metric])

In [None]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

In [None]:
print('Sucess:', result.test_results[1].metrics_data[0].success)
print('Score:', result.test_results[1].metrics_data[0].score)
print('Reason:', result.test_results[1].metrics_data[0].reason)

## Contextual Relevancy

The contextual relevancy metric measures the quality of your RAG pipeline's retriever by evaluating the overall relevance of the information presented in your `retrieval_context` for a given `input`.

`deepeval`'s contextual relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score using an LLM as a Judge.

In `deepeval`, to use the ContextualRelevancyMetric, you'll have to provide the following arguments when creating an `LLMTestCase`:

- `input` : Input Query
- `actual_output` : Actual LLM Response (not used in the computation)
- `retrieval_context` : Top-N retrieved document chunks (nodes) from Vector DB


![](https://i.imgur.com/VLKoEsI.png)





In [None]:
query = "What is AI?"
response = rag_chain_w_sources.invoke(query)
response

### Example 1:

In [None]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

In [None]:
new_context = ['NVIDIA makes chips for AI', 'Google and Microsoft are battling out the market share for AI Chatbots'] + retrieved_context
new_context

In [None]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import ContextualRelevancyMetric
from deepeval import evaluate

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
    expected_output=human_answer,
    retrieval_context=new_context
)

metric = ContextualRelevancyMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

In [None]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)