**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Generator Evaluation Metrics

## Overview
- This notebook demonstrates how to evaluate RAG system generators using DeepEval metrics
- It focuses on four key evaluation metrics for assessing generation quality in RAG pipelines

## Key Evaluation Metrics

### **Answer Relevancy**
- **Purpose**: Measures how relevant the generated answer is to the input query
- **Implementation Options**:
  - **DeepEval**: LLM-based evaluation using GPT-4o as judge
  - **RAGAS**: Semantic similarity-based evaluation using cosine similarity
- **Input Requirements**: Query and generated response
- **Use Case**: Assesses if the answer directly addresses the user's question

### **Faithfulness**
- **Purpose**: Measures whether the generated output factually aligns with retrieved context
- **Input Requirements**: Query, generated response, and retrieval context
- **Scoring**: Evaluates factual consistency between response and source documents
- **Use Case**: Ensures generated answers are grounded in retrieved information

### **Hallucination Check**
- **Purpose**: Determines if the LLM generates factually correct information
- **Input Requirements**: Query, generated response, and human ground truth context
- **Scoring**: Evaluates factual accuracy against verified information
- **Use Case**: Detects when AI generates false or unsupported claims

### **Custom LLM as Judge (G-Eval)**
- **Purpose**: Uses advanced LLM models to evaluate response quality
- **Implementation**: Leverages GPT-4o for sophisticated evaluation reasoning
- **Benefits**: Provides detailed explanations for metric scores

## Technical Implementation

- **DeepEval Framework**: Comprehensive evaluation suite for RAG systems
- **LLM Judge**: GPT-4o model evaluates quality and provides reasoning
- **Test Cases**: Creates LLMTestCase objects for systematic evaluation
- **Configurable Thresholds**: Adjustable success criteria for each metric
- **Verbose Mode**: Detailed reasoning and evaluation breakdowns

## Evaluation Process

1. **Setup**: Run existing RAG pipeline to get generation results
2. **Context Preparation**: Extract retrieved documents and ground truth
3. **Metric Configuration**: Set up evaluation parameters and thresholds
4. **Testing**: Run evaluation on test cases with different scenarios
5. **Analysis**: Review scores, reasons, and pass/fail results

## Benefits

- **Quality Assurance**: Systematic evaluation of generation performance
- **Hallucination Detection**: Identifies when AI generates false information
- **Relevance Assessment**: Ensures answers directly address user queries
- **Factual Verification**: Confirms generated content aligns with source material
- **Continuous Improvement**: Provides metrics to optimize generation quality

# Generator Evaluation Metrics

![](https://i.imgur.com/GaMHy7w.png)

The generation step, which comes after retrieval, generally includes:

- Building a prompt that combines the initial input with the context retrieved in the previous step.
- Feeding this prompt to your LLM, which produces the final generated response.


Key Metrics to Evaluate here include:

- Answer Relevancy
- Faithfulness
- Hallucination Check
- Custom LLM as a Judge (G-Eval)

## LLM-based Answer Relevancy - DeepEval

The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the `actual_output` of your LLM application is compared to the provided `input`.

`deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score using an LLM as a Judge.

In `deepeval`, to use the AnswerRelevancyMetric, you'll have to provide the following arguments when creating an `LLMTestCase`:

- `input` : Input Query
- `actual_output` : Actual LLM Response


![](https://i.imgur.com/GbNSCFC.png)



## Semantic Similarity based Answer Relevancy - RAGAS

DeepEval has bindings to Ragas which enables us to use the RAGASAnswerRelevancyMetric which focuses on assessing how pertinent the generated answer is to the given query using cosine similarity. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy.

![](https://i.imgur.com/vq1ytZ3.png)





In [1]:
%run Build_RAG_Pipeline_with_Source.ipynb

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-community 0.3.11 requires langchain<0.4.0,>=0.3.11, but you have langchain 0.3.10 which is incompatible.[0m[31m
[0mMetadata: {'id': 10, 'title': 'Artificial Intelligence'}
Content Brief:


Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.


Metadata: {'id': 10, 'title': 'Artificial Intelligence'}
Content Brief:


Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.


Metadata: {'id': 3, 'title': 'Natural Language Processing (NLP)'}
Content Brief:


NLP is a branch of AI that enables computers to understand, interpret, and generate human language. Techniques include tokenization, stemming, and sentiment analysis. Applications range from chatbots to language translation services.


Metadata: {'id': 5, 'title': 'Photosynthesis'}
Content Brief:


Photosynthesis is the process plants use to convert sunlight into energy. This process produces glucose and releases oxygen as a byproduct. It is crucial for sustaining life on Earth by providing food and oxygen.


Metadata: {'id': 5, 'title': 'Photosynthesis'}
Content Brief:


Photosynthesis is the process plants use to convert sunlight into energy. This process produces glucose and releases oxygen as a byproduct. It is crucial for sustaining life on Earth by providing food and oxygen.




In [2]:
query = "What is AI?"
response = rag_chain_w_sources.invoke(query)
response

{'context': [Document(metadata={'id': 10, 'title': 'Artificial Intelligence'}, page_content="Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning."),
  Document(metadata={'id': 10, 'title': 'Artificial Intelligence'}, page_content="Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning."),
  Document(metadata={'id': 3, 'title': 'Natural Language Processing (NLP)'}, page_content='NLP is a branch of AI that enables computers to understand, interpret, and generate human language. Techniques include tokenization, stemming, and sentiment analysis. Applications rang

### Example - DeepEval:

In [16]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
)

metric = AnswerRelevancyMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:04,  4.31s/test case]

**************************************************
Answer Relevancy Verbose Logs
**************************************************

Statements:
[
    "AI refers to machines mimicking human intelligence, such as problem-solving and learning, and includes applications like virtual assistants, robotics, and autonomous vehicles."
] 
 
Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    }
]
 
Score: 1.0
Reason: The score is 1.00 because the answer is perfectly relevant and directly addresses the question 'What is AI?' without any irrelevant information. Great job!



Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the answer is perfectly relevant and directly addresses the question 'What is AI?' without any irrelevant information. Great job!, error: None)

For test case:

  - input: What is AI?
  - actual output: AI refers to machines mimicking human intelligence, such as prob




In [13]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because the response was perfectly relevant with no irrelevant statements. Great job!


### Example - RAGAS:

In [17]:
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
)

metric = RAGASAnswerRelevancyMetric(
    threshold=0.5,
    model="gpt-4o",
    embeddings=OpenAIEmbeddings()
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


  from .autonotebook import tqdm as notebook_tqdm
Evaluating: 100%|██████████| 1/1 [00:04<00:00,  4.36s/it]
Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:10, 10.53s/test case]



Metrics Summary

  - ✅ Answer Relevancy (ragas) (score: 0.9208820444994655, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: None, error: None)

For test case:

  - input: What is AI?
  - actual output: AI refers to machines mimicking human intelligence, such as problem-solving and learning, and includes applications like virtual assistants, robotics, and autonomous vehicles.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Answer Relevancy (ragas): 100.00% pass rate







In [18]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.9208820444994655
Reason: None


## Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`.

`deepeval`'s faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score using an LLM as a Judge.

In `deepeval`, to use the FaithfulnessMetric, you'll have to provide the following arguments when creating an `LLMTestCase`:

- `input` : Input Query (not used in the computation)
- `actual_output` : Actual LLM Response
- `retrieval_context` : Top-N retrieved document chunks (nodes) from Vector DB


![](https://i.imgur.com/OCSFPTb.png)





In [19]:
query = "What is AI?"
response = rag_chain_w_sources.invoke(query)
response

{'context': [Document(metadata={'id': 10, 'title': 'Artificial Intelligence'}, page_content="Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning."),
  Document(metadata={'id': 10, 'title': 'Artificial Intelligence'}, page_content="Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning."),
  Document(metadata={'id': 3, 'title': 'Natural Language Processing (NLP)'}, page_content='NLP is a branch of AI that enables computers to understand, interpret, and generate human language. Techniques include tokenization, stemming, and sentiment analysis. Applications rang

### Example:

In [20]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

["Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.",
 "Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.",
 'NLP is a branch of AI that enables computers to understand, interpret, and generate human language. Techniques include tokenization, stemming, and sentiment analysis. Applications range from chatbots to language translation services.']

In [21]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric
from deepeval import evaluate

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
    retrieval_context=retrieved_context
)

metric = FaithfulnessMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

None


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:08,  8.24s/test case]

**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "Artificial intelligence refers to machines mimicking human intelligence.",
    "AI includes applications like virtual assistants, robotics, and autonomous vehicles.",
    "AI is evolving rapidly with advancements in machine learning and deep learning.",
    "NLP is a branch of AI that enables computers to understand, interpret, and generate human language.",
    "NLP techniques include tokenization, stemming, and sentiment analysis.",
    "Applications of NLP range from chatbots to language translation services."
] 
 
Claims:
[
    "AI refers to machines mimicking human intelligence, such as problem-solving and learning.",
    "AI includes applications like virtual assistants, robotics, and autonomous vehicles."
] 
 
Verdicts:
[
    {
        "verdict": "idk",
        "reason": null
    },
    {
        "verdict": "yes",
        "r




In [22]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because there are no contradictions, indicating a perfect alignment between the actual output and the retrieval context. Great job maintaining consistency!


## Hallucination Check

The hallucination metric determines whether your LLM generates factually correct information by comparing the `actual_output` to the provided (human ground truth) `context`.

`deepeval`'s hallucination metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score using an LLM as a Judge.

In `deepeval`, to use the HallucinationMetric, you'll have to provide the following arguments when creating an `LLMTestCase`:

- `input` : Input Query (not used in the computation)
- `actual_output` : Actual LLM Response
- `context` : Human Ground Truth Context Document Chunks (Nodes)


![](https://i.imgur.com/qyVBKU2.png)





In [23]:
query = "What is AI?"
response = rag_chain_w_sources.invoke(query)
response

{'context': [Document(metadata={'id': 10, 'title': 'Artificial Intelligence'}, page_content="Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning."),
  Document(metadata={'id': 10, 'title': 'Artificial Intelligence'}, page_content="Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning."),
  Document(metadata={'id': 3, 'title': 'Natural Language Processing (NLP)'}, page_content='NLP is a branch of AI that enables computers to understand, interpret, and generate human language. Techniques include tokenization, stemming, and sentiment analysis. Applications rang

### Example:

In [24]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

["Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.",
 "Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.",
 'NLP is a branch of AI that enables computers to understand, interpret, and generate human language. Techniques include tokenization, stemming, and sentiment analysis. Applications range from chatbots to language translation services.']

In [25]:
human_ground_truth_context = ["Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.",
                              "Machine learning is a field of artificial intelligence focused on enabling systems to learn patterns from data. Algorithms analyze past data to make predictions or classify information. Popular applications include recommendation systems and image recognition."]
human_ground_truth_context

["Artificial intelligence refers to machines mimicking human intelligence, like problem-solving and learning. AI includes applications like virtual assistants, robotics, and autonomous vehicles. It's evolving rapidly with advancements in machine learning and deep learning.",
 'Machine learning is a field of artificial intelligence focused on enabling systems to learn patterns from data. Algorithms analyze past data to make predictions or classify information. Popular applications include recommendation systems and image recognition.']

In [30]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric
from deepeval import evaluate

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['response'],
    context=human_ground_truth_context
)

metric = HallucinationMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:03,  3.47s/test case]

**************************************************
Hallucination Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The actual output agrees with the provided context which states that AI refers to machines mimicking human intelligence, like problem-solving and learning, and includes applications like virtual assistants, robotics, and autonomous vehicles. It also mentions the rapid evolution of AI with advancements in machine learning and deep learning."
    },
    {
        "verdict": "yes",
        "reason": "The actual output does not contradict the provided context about machine learning being a field of artificial intelligence focused on enabling systems to learn patterns from data. The detail about machine learning and its applications does not conflict with the general description of AI in the actual output."
    }
]
 
Score: 0.0
Reason: The score is 0.00 because the actual output fully aligns with the 




In [31]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.0
Reason: The score is 0.00 because the actual output fully aligns with the provided context and there are no contradictions.


In [32]:
ai_response = 'AI refers to machines mimicking human intelligence to produce cyborgs and electric sheep'

In [33]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=ai_response,
    context=human_ground_truth_context
)

metric = HallucinationMetric(
    threshold=0.5,
    model="gpt-4o",
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:04,  4.68s/test case]

**************************************************
Hallucination Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The actual output does not agree with the provided context. The context states that AI includes applications like virtual assistants, robotics, and autonomous vehicles, not the production of cyborgs and electric sheep."
    },
    {
        "verdict": "no",
        "reason": "The actual output contradicts the context which defines machine learning as enabling systems to learn patterns from data and does not mention producing cyborgs and electric sheep."
    }
]
 
Score: 1.0
Reason: The score is 1.00 because there are multiple contradictions between the actual output and the provided context, with no factual alignments present. The output incorrectly associates AI and machine learning with the production of cyborgs and electric sheep, which is not supported by the context.



Metrics Summary

  - ❌




In [34]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 1.0
Reason: The score is 1.00 because there are multiple contradictions between the actual output and the provided context, with no factual alignments present. The output incorrectly associates AI and machine learning with the production of cyborgs and electric sheep, which is not supported by the context.
