# Deep Evaluation of RAG Systems using deepeval

## Overview

This code demonstrates the use of the `deepeval` library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems. It covers various evaluation metrics and provides a framework for creating and running test cases.

## Key Components

1. Correctness Evaluation
2. Faithfulness Evaluation
3. Contextual Relevancy Evaluation
4. Combined Evaluation of Multiple Metrics
5. Batch Test Case Creation

## Evaluation Metrics

### 1. Correctness (GEval)

- Evaluates whether the actual output is factually correct based on the expected output.
- Uses GPT-4 as the evaluation model.
- Compares the expected and actual outputs.

### 2. Faithfulness (FaithfulnessMetric)

- Assesses whether the generated answer is faithful to the provided context.
- Uses GPT-4 as the evaluation model.
- Can provide detailed reasons for the evaluation.

### 3. Contextual Relevancy (ContextualRelevancyMetric)

- Evaluates how relevant the retrieved context is to the question and answer.
- Uses GPT-4 as the evaluation model.
- Can provide detailed reasons for the evaluation.

## Key Features

1. Flexible Metric Configuration: Each metric can be customized with different models and parameters.
2. Multi-Metric Evaluation: Ability to evaluate test cases using multiple metrics simultaneously.
3. Batch Test Case Creation: Utility function to create multiple test cases efficiently.
4. Detailed Feedback: Options to include detailed reasons for evaluation results.

## Benefits of this Approach

1. Comprehensive Evaluation: Covers multiple aspects of RAG system performance.
2. Flexibility: Easy to add or modify evaluation metrics and test cases.
3. Scalability: Capable of handling multiple test cases and metrics efficiently.
4. Interpretability: Provides detailed reasons for evaluation results, aiding in system improvement.

## Conclusion

This deep evaluation approach using the `deepeval` library offers a robust framework for assessing the performance of RAG systems. By evaluating correctness, faithfulness, and contextual relevancy, it provides a multi-faceted view of system performance. This comprehensive evaluation is crucial for identifying areas of improvement and ensuring the reliability and effectiveness of RAG systems in real-world applications.

In [2]:
!pip install deepeval

Collecting deepeval
  Downloading deepeval-1.3.2-py3-none-any.whl.metadata (977 bytes)
Collecting sentry-sdk (from deepeval)
  Downloading sentry_sdk-2.14.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pytest-repeat (from deepeval)
  Downloading pytest_repeat-0.9.3-py3-none-any.whl.metadata (4.9 kB)
Collecting portalocker (from deepeval)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting ragas (from deepeval)
  Downloading ragas-0.1.20-py3-none-any.whl.metadata (5.5 kB)
Collecting docx2txt~=0.8 (from deepeval)
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting tenacity~=8.4.1 (from deepeval)
  Downloading tenacity-8.4.2-py3-none-any.whl.metadata (1.2 kB)
Collecting opentelemetry-api~=1.24.0 (from deepeval)
  Downloading opentelemetry_api-1.24.0-py3-none-any.whl.metadata (1.3 kB)
Collectin

In [3]:
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

### Test Correctness

In [4]:
correctness_metric = GEval(
    name="Correctness",
    model="gpt-4o",
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT],
        evaluation_steps=[
        "Determine whether the actual output is factually correct based on the expected output."
    ],

)

gt_answer = "Madrid is the capital of Spain."
pred_answer = "MadriD."

test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output=gt_answer,
    actual_output=pred_answer,
)

correctness_metric.measure(test_case_correctness)
print(correctness_metric.score)

0.14338831977995006


### Test faithfulness

In [5]:
question = "what is 3+3?"
context = ["6"]
generated_answer = "6"

faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=False
)

test_case = LLMTestCase(
    input = question,
    actual_output=generated_answer,
    retrieval_context=context

)

faithfulness_metric.measure(test_case)
print(faithfulness_metric.score)
print(faithfulness_metric.reason)



1
None


### Test contextual relevancy 

In [6]:
actual_output = "then go somewhere else."
retrieval_context = ["this is a test context","mike is a cat","if the shoes don't fit, then go somewhere else."]
gt_answer = "if the shoes don't fit, then go somewhere else."

relevance_metric = ContextualRelevancyMetric(
    threshold=1,
    model="gpt-4",
    include_reason=True
)
relevance_test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context,
    expected_output=gt_answer,

)

relevance_metric.measure(relevance_test_case)
print(relevance_metric.score)
print(relevance_metric.reason)

0.3333333333333333
The score is 0.33 because the provided contexts, 'this is a test context' and 'mike is a cat', do not provide any relevant information about what to do if the shoes don't fit.


In [7]:
new_test_case = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output="Madrid is the capital of Spain.",
    actual_output="MadriD.",
    retrieval_context=["Madrid is the capital of Spain."]
)

### Test two different cases together with several metrics together

In [8]:
evaluate(
    test_cases=[relevance_test_case, new_test_case],
    metrics=[correctness_metric, faithfulness_metric, relevance_metric]
)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 2 test case(s) in parallel: |██████████|100% (2/2) [Time Taken: 00:06,  3.01s/test case]



Metrics Summary

  - ❌ Correctness (GEval) (score: 0.1274114511883731, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output 'MadriD.' is not factually correct and does not match the expected output 'Madrid is the capital of Spain.', error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4, reason: None, error: None)
  - ✅ Contextual Relevancy (score: 1.0, threshold: 1.0, strict: False, evaluation model: gpt-4, reason: The score is 1.00 because the retrieval context perfectly matches the input query, with no discrepancies., error: None)

For test case:

  - input: What is the capital of Spain?
  - actual output: MadriD.
  - expected output: Madrid is the capital of Spain.
  - context: None
  - retrieval context: ['Madrid is the capital of Spain.']


Metrics Summary

  - ❌ Correctness (GEval) (score: 0.4579656221239533, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output captures par




[TestResult(success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.1274114511883731, reason="The actual output 'MadriD.' is not factually correct and does not match the expected output 'Madrid is the capital of Spain.'", strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.0016749999999999998, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n]'), MetricData(name='Faithfulness', threshold=0.7, success=True, score=1.0, reason=None, strict_mode=False, evaluation_model='gpt-4', error=None, evaluation_cost=0.012299999999999998, verbose_logs='Truths:\n[\n    "Madrid is the capital of Spain."\n] \n \nClaims:\n[] \n \nVerdicts:\n[]'), MetricData(name='Contextual Relevancy', threshold=1.0, success=True, score=1.0, reason='The score is 1.00 because the retrieval context perfectly matches the input query, with no discrepanc

### Funcion to create multiple LLMTestCases based on four lists: 
* Questions
* Ground Truth Answers
* Generated Answers
* Retrieved Documents - Each element is a list

In [9]:
def create_deep_eval_test_cases(questions, gt_answers, generated_answers, retrieved_documents):
    return [
        LLMTestCase(
            input=question,
            expected_output=gt_answer,
            actual_output=generated_answer,
            retrieval_context=retrieved_document
        )
        for question, gt_answer, generated_answer, retrieved_document in zip(
            questions, gt_answers, generated_answers, retrieved_documents
        )
    ]