<div align="center">
<h2>Evaluating RAG</h2>
</div>

<div align="center">
    <h3 ><a href="https://aiengineering.academy/" target="_blank">AI Engineering.academy</a></h3>
    
    
</div>

<div align="center">
<a href="https://aiengineering.academy/" target="_blank">
<img src="https://raw.githubusercontent.com/adithya-s-k/AI-Engineering.academy/main/assets/banner.png" alt="Ai Engineering. Academy" width="50%">
</a>
</div>


<div align="center">

[![GitHub Stars](https://img.shields.io/github/stars/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/pulls)
[![License](https://img.shields.io/github/license/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/blob/main/LICENSE)

</div>

## Introduction

Evaluation is a critical component in the development and optimization of Retrieval-Augmented Generation (RAG) systems. It involves assessing the performance, accuracy, and quality of various aspects of the RAG pipeline, from retrieval effectiveness to the relevance and faithfulness of generated responses.

## Importance of Evaluation in RAG

Effective evaluation of RAG systems is essential because it:
1. Helps identify strengths and weaknesses in the retrieval and generation processes.
2. Guides improvements and optimizations across the RAG pipeline.
3. Ensures the system meets quality standards and user expectations.
4. Facilitates comparison between different RAG implementations or configurations.
5. Helps detect issues such as hallucinations, biases, or irrelevant responses.


## Key Evaluation Metrics

### RAGAS Metrics
1. **Faithfulness**: Measures how well the generated response aligns with the retrieved context.
2. **Answer Relevancy**: Assesses the relevance of the response to the query.
3. **Context Recall**: Evaluates how well the retrieved chunks cover the information needed to answer the query.
4. **Context Precision**: Measures the proportion of relevant information in the retrieved chunks.
5. **Context Utilization**: Assesses how effectively the generated response uses the provided context.
6. **Context Entity Recall**: Evaluates the coverage of important entities from the context in the response.
7. **Noise Sensitivity**: Measures the system's robustness to irrelevant or noisy information.
8. **Summarization Score**: Assesses the quality of summarization in the response.

### DeepEval Metrics
1. **G-Eval**: A general evaluation metric for text generation tasks.
2. **Summarization**: Assesses the quality of text summarization.
3. **Answer Relevancy**: Measures how well the response answers the query.
4. **Faithfulness**: Evaluates the accuracy of the response with respect to the source information.
5. **Contextual Recall and Precision**: Measures the effectiveness of context retrieval.
6. **Hallucination**: Detects fabricated or inaccurate information in the response.
7. **Toxicity**: Identifies harmful or offensive content in the response.
8. **Bias**: Detects unfair prejudice or favoritism in the generated content.

### Trulens Metrics
1. **Context Relevance**: Assesses how well the retrieved context matches the query.
2. **Groundedness**: Measures how well the response is supported by the retrieved information.
3. **Answer Relevance**: Evaluates how well the response addresses the query.
4. **Comprehensiveness**: Assesses the completeness of the response.
5. **Harmful/Toxic Language**: Identifies potentially offensive or dangerous content.
6. **User Sentiment**: Analyzes the emotional tone of user interactions.
7. **Language Mismatch**: Detects inconsistencies in language use between query and response.
8. **Fairness and Bias**: Evaluates the system for equitable treatment across different groups.
9. **Custom Feedback Functions**: Allows for tailored evaluation metrics specific to use cases.

## Best Practices for RAG Evaluation

1. **Comprehensive Evaluation**: Use a combination of metrics to assess different aspects of the RAG system.
2. **Regular Benchmarking**: Continuously evaluate the system as changes are made to the pipeline.
3. **Human-in-the-Loop**: Incorporate human evaluation alongside automated metrics for a holistic assessment.
4. **Domain-Specific Metrics**: Develop custom metrics relevant to your specific use case or domain.
5. **Error Analysis**: Investigate patterns in low-scoring responses to identify areas for improvement.
6. **Comparative Evaluation**: Benchmark your RAG system against baseline models and alternative implementations.

## Conclusion

A robust evaluation framework is crucial for developing and maintaining high-quality RAG systems. By leveraging a diverse set of metrics and following best practices, developers can ensure their RAG systems deliver accurate, relevant, and trustworthy responses while continuously improving performance.

## Setting up the Environment

In [None]:
!pip install llama-index
!pip install llama-index-vector-stores-qdrant 
!pip install llama-index-readers-file 
!pip install llama-index-embeddings-fastembed 
!pip install llama-index-llms-openai
!pip install llama-index-llms-groq
!pip install -U qdrant_client fastembed
!pip install python-dotenv

In [1]:
# Standard library imports
import logging
import sys
import os

# Third-party imports
from dotenv import load_dotenv
from IPython.display import Markdown, display

# Qdrant client import
import qdrant_client

# LlamaIndex core imports
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

# LlamaIndex vector store import
from llama_index.vector_stores.qdrant import QdrantVectorStore

# Embedding model imports
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding

# LLM import
from llama_index.llms.openai import OpenAI
from llama_index.llms.groq import Groq
# Load environment variables
load_dotenv()

# Get OpenAI API key from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GROK_API_KEY = os.getenv("GROQ_API_KEY")

# Setting up Base LLM
Settings.llm = OpenAI(
    model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True
)

# Settings.llm = Groq(model="llama3-70b-8192" , api_key=GROK_API_KEY)

# Set the embedding model
# Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default)
# Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Option 2: Use OpenAI's embedding model (commented out)
# If you want to use OpenAI's embedding model, uncomment the following line:
Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY)

# Qdrant configuration (commented out)
# If you're using Qdrant, uncomment and set these variables:
# QDRANT_CLOUD_ENDPOINT = os.getenv("QDRANT_CLOUD_ENDPOINT")
# QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

# Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Load the Data

In [2]:
# lets loading the documents using SimpleDirectoryReader

print("🔃 Loading Data")

from llama_index.core import Document
reader = SimpleDirectoryReader("../data/" , recursive=True)
documents = reader.load_data(show_progress=True)

🔃 Loading Data


Loading files: 100%|██████████| 1/1 [00:00<00:00,  4.27file/s]


## Setting up Vector Database

We will be using qDrant as the Vector database
There are 4 ways to initialize qdrant 

1. Inmemory
```python
client = qdrant_client.QdrantClient(location=":memory:")
```
2. Disk
```python
client = qdrant_client.QdrantClient(path="./data")
```
3. Self hosted or Docker
```python

client = qdrant_client.QdrantClient(
    # url="http://<host>:<port>"
    host="localhost",port=6333
)
```

4. Qdrant cloud
```python
client = qdrant_client.QdrantClient(
    url=QDRANT_CLOUD_ENDPOINT,
    api_key=QDRANT_API_KEY,
)
```

for this notebook we will be using qdrant cloud

In [3]:
# creating a qdrant client instance

client = qdrant_client.QdrantClient(
    # you can use :memory: mode for fast and light-weight experiments,
    # it does not require to have Qdrant deployed anywhere
    # but requires qdrant-client >= 1.1.1
    # location=":memory:"
    # otherwise set Qdrant instance address with:
    # url=QDRANT_CLOUD_ENDPOINT,
    # otherwise set Qdrant instance with host and port:
    host="localhost",
    port=6333
    # set API KEY for Qdrant Cloud
    # api_key=QDRANT_API_KEY,
    # path="./db/"
)

vector_store = QdrantVectorStore(client=client, collection_name="01_Basic_RAG")

### Ingest Data into vector DB

In [5]:
## ingesting data into vector database

## lets set up an ingestion pipeline

from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
        # MarkdownNodeParser(include_metadata=True),
        # TokenTextSplitter(chunk_size=500, chunk_overlap=20),
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        # SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),
        Settings.embed_model,
    ],
    vector_store=vector_store,
)

# Ingest directly into a vector db
nodes = pipeline.run(documents=documents , show_progress=True)
print("Number of chunks added to vector DB :",len(nodes))

Parsing nodes: 100%|██████████| 58/58 [00:00<00:00, 555.31it/s]
Generating embeddings: 100%|██████████| 58/58 [00:08<00:00,  7.04it/s]


Number of chunks added to vector DB : 58


## Setting Up Index

In [6]:
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

## Modifying Prompts and Prompt Tuning

In [7]:
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

refine_prompt_str = (
    "We have the opportunity to refine the original answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question: {query_str}. "
    "If the context isn't useful, output the original answer again.\n"
    "Original Answer: {existing_answer}"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ("system","You are a AI assistant who is well versed with answering questions from the provided context"),
    ("user", qa_prompt_str),
]
text_qa_template = ChatPromptTemplate.from_messages(chat_text_qa_msgs)

# Refine Prompt
chat_refine_msgs = [
    ("system","Always answer the question, even if the context isn't helpful.",),
    ("user", refine_prompt_str),
]
refine_template = ChatPromptTemplate.from_messages(chat_refine_msgs)

### Example of Retrivers 

- Query Engine
- Chat Engine

In [11]:
# Setting up Query Engine
BASE_RAG_QUERY_ENGINE = index.as_query_engine(
        similarity_top_k=5,
        text_qa_template=text_qa_template,
        refine_template=refine_template,)


response = BASE_RAG_QUERY_ENGINE.query("How many encoders are stacked in the encoder?")
display(Markdown(str(response)))

According to the context information, the encoder is composed of a stack of N = 6 identical layers.

In [12]:
# Setting up Chat Engine
BASE_RAG_CHAT_ENGINE = index.as_chat_engine()

response = BASE_RAG_CHAT_ENGINE.chat("How many encoders are stacked in the encoder?")
display(Markdown(str(response)))

The number of encoders stacked in the encoder is 6.

### Setup Observability

In [None]:
!pip install arize-phoenix[llama-index]

In [None]:
import phoenix as px

(session := px.launch_app()).view()

In [None]:
from openinference.instrumentation.langchain import LangChainInstrumentor
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import SpanLimits, TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000))
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)

## Generating Test Dataset

Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.

In [None]:
!pip install ragas 

In [None]:
from phoenix.trace import using_project
from ragas.testset.evolutions import multi_context, reasoning, simple
from ragas.testset.generator import TestsetGenerator

TEST_SIZE = 5

# generator with openai models
generator = TestsetGenerator.with_openai()

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# generate testset
with using_project("ragas-testset"):
    testset = generator.generate_with_llamaindex_docs(
        documents, test_size=TEST_SIZE, distributions=distribution
    )
test_df = (
    testset.to_pandas().sort_values("question").drop_duplicates(subset=["question"], keep="first")
)
test_df.head(2)

You are free to change the question type distribution according to your needs. Since we now have our test dataset ready, let’s move on and build a simple RAG pipeline using LlamaIndex.

### RAGAS

In [None]:
import pandas as pd
from datasets import Dataset
from phoenix.trace import using_project
from tqdm.auto import tqdm


def generate_response(query_engine, question):
    response = query_engine.query(question)
    return {
        "answer": response.response,
        "contexts": [c.node.get_content() for c in response.source_nodes],
    }


def generate_ragas_dataset(query_engine, test_df):
    test_questions = test_df["question"].values
    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]

    dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts": [response["contexts"] for response in responses],
        "ground_truth": test_df["ground_truth"].values.tolist(),
    }
    ds = Dataset.from_dict(dataset_dict)
    return ds


with using_project("llama-index"):
    ragas_eval_dataset = generate_ragas_dataset(BASE_RAG_QUERY_ENGINE, test_df)

ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
ragas_evals_df.head(2)

In [None]:
# dataset containing embeddings for visualization
query_embeddings_df = px.Client().query_spans(
    SpanQuery().explode("embedding.embeddings", text="embedding.text", vector="embedding.vector"),
    project_name="llama-index",
)
query_embeddings_df.head(2)

In [None]:
from phoenix.session.evaluation import get_qa_with_reference

# dataset containing span data for evaluation with Ragas
spans_dataframe = get_qa_with_reference(client, project_name="llama-index")
spans_dataframe.head(2)

Ragas uses LangChain to evaluate your LLM application data. Since we initialized the LangChain instrumentation above we can see what's going on under the hood when we evaluate our LLM application.

In [None]:
from phoenix.trace import using_project
from ragas import evaluate
from ragas.metrics import (
    answer_correctness,
    context_precision,
    context_recall,
    faithfulness,
)

# Log the traces to the project "ragas-evals" just to view
# how Ragas works under the hood
with using_project("ragas-evals"):
    evaluation_result = evaluate(
        dataset=ragas_eval_dataset,
        metrics=[faithfulness, answer_correctness, context_recall, context_precision],
    )
eval_scores_df = pd.DataFrame(evaluation_result.scores)

In [None]:
# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans).
span_questions = (
    spans_dataframe[["input"]]
    .sort_values("input")
    .drop_duplicates(subset=["input"], keep="first")
    .reset_index()
    .rename({"input": "question"}, axis=1)
)
ragas_evals_df = ragas_evals_df.merge(span_questions, on="question").set_index("context.span_id")
test_df = test_df.merge(span_questions, on="question").set_index("context.span_id")
eval_data_df = pd.DataFrame(evaluation_result.dataset)
eval_data_df = eval_data_df.merge(span_questions, on="question").set_index("context.span_id")
eval_scores_df.index = eval_data_df.index

query_embeddings_df = (
    query_embeddings_df.sort_values("text")
    .drop_duplicates(subset=["text"])
    .rename({"text": "question"}, axis=1)
    .merge(span_questions, on="question")
    .set_index("context.span_id")
)

## Deep Eval

In [None]:
from deepeval.integrations.llama_index import (
    DeepEvalAnswerRelevancyEvaluator,
    DeepEvalFaithfulnessEvaluator,
    DeepEvalContextualRelevancyEvaluator,
    DeepEvalSummarizationEvaluator,
    DeepEvalBiasEvaluator,
    DeepEvalToxicityEvaluator,
)

# An example input to your RAG application
user_input = "What is Attention"

# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = BASE_RAG_QUERY_ENGINE.query(user_input)

# Create a list of all evaluators
evaluators = [
    DeepEvalAnswerRelevancyEvaluator(),
    DeepEvalFaithfulnessEvaluator(),
    DeepEvalContextualRelevancyEvaluator(),
    DeepEvalSummarizationEvaluator(),
    DeepEvalBiasEvaluator(),
    DeepEvalToxicityEvaluator(),
]

# Evaluate the response using all evaluators
for evaluator in evaluators:
    evaluation_result = evaluator.evaluate_response(
        query=user_input, response=response_object
    )
    print(f"{evaluator.__class__.__name__} Result:")
    print(evaluation_result)
    print("\n" + "="*50 + "\n") 

## Truelens

In [None]:
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai

In [None]:
from trulens.core import TruSession

session = TruSession()
session.reset_database()

In [None]:
import numpy as np
from trulens.apps.llamaindex import TruLlama
from trulens.core import Feedback
from trulens.providers.openai import OpenAI

# Initialize provider class
provider = OpenAI()

# select context to be used in feedback. the location of context is app specific.

context = TruLlama.select_context(BASE_RAG_QUERY_ENGINE)

# Define a groundedness feedback function
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(context.collect())  # collect context chunks into a list
    .on_output()
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = Feedback(
    provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

In [None]:
tru_query_engine_recorder = TruLlama(
    BASE_RAG_QUERY_ENGINE,
    app_name="LlamaIndex_App",
    app_version="base",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)

In [None]:
# or as context manager
with tru_query_engine_recorder as recording:
    BASE_RAG_QUERY_ENGINE.query("What is Attention")

In [None]:
# The record of the app invocation can be retrieved from the `recording`:

rec = recording.get()  # use .get if only one record
# recs = recording.records # use .records if multiple

display(rec)

In [None]:
from trulens.dashboard import run_dashboard

run_dashboard(session)

In [None]:
# The results of the feedback functions can be rertireved from
# `Record.feedback_results` or using the `wait_for_feedback_result` method. The
# results if retrieved directly are `Future` instances (see
# `concurrent.futures`). You can use `as_completed` to wait until they have
# finished evaluating or use the utility method:

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)

# See more about wait_for_feedback_results:
# help(rec.wait_for_feedback_results)

In [None]:
records, feedback = session.get_records_and_feedback()

records.head()

In [None]:
session.get_leaderboard()

In [None]:
run_dashboard(session)  # open a local streamlit app to explore

# stop_dashboard(session) # stop if needed