**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Build a Simple RAG System

## Install OpenAI, and LangChain dependencies

In [0]:
!pip install langchain==0.3.10
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11
!pip install dill

## Install Chroma Vector DB and LangChain wrapper

In [0]:
!pip install langchain-chroma==0.1.4

## Install RAG Evaluation Libraries

In [0]:
!pip install ragas==0.2.8
!pip install deepeval==1.4.7

## Enter Open AI API Key

In [0]:
from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API Key: ')

## Setup Environment Variables

In [0]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [0]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Loading and Processing the Data

### Get the dataset

In [0]:
# if you can't download using the following code
# go to https://drive.google.com/file/d/1QkSY9W5RyaBnY8c5FLIsmpPVXoHTQ-fb/view?usp=sharing download it
# manually upload it on colab
!gdown 1QkSY9W5RyaBnY8c5FLIsmpPVXoHTQ-fb

### Load and Process JSON Documents

In [0]:
import pandas as pd

df = pd.read_csv('./rag_eval_docs.csv')
df

In [0]:
docs = df.to_dict(orient='records')
docs[:3]

In [0]:
from langchain.docstore.document import Document
processed_docs = []

for doc in docs:
    metadata = {
        "title": doc['title'],
        "id": doc['id'],
    }
    data = doc['context']
    processed_docs.append(Document(page_content=data, metadata=metadata))
processed_docs[:3]

## Index Document Chunks and Embeddings in Vector DB

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [0]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=processed_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [0]:
# load from disk
chroma_db = Chroma(persist_directory="./my_db",
                   collection_name='my_db',
                   embedding_function=openai_embed_model)

In [0]:
chroma_db

### Semantic Similarity based Retrieval

We use simple cosine similarity here and retrieve the top 3 similar documents based on the user input query

In [0]:
similarity_retriever = chroma_db.as_retriever(search_type="similarity_score_threshold",
                                              search_kwargs={"k": 3, "score_threshold": 0.3})

In [0]:
from IPython.display import display, Markdown

def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content))
        print()

In [0]:
query = "what is AI?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "how do plants survive?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

## Build the RAG Pipeline

In [0]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer to the point based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

In [0]:
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableLambda
from operator import itemgetter


chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

src_rag_response_chain = (
    {
        "context": (itemgetter('context')
                        |
                    RunnableLambda(format_docs)),
        "question": itemgetter("question")
    }
        |
    rag_prompt_template
        |
    chatgpt
        |
    StrOutputParser()
)

rag_chain_w_sources = (
    {
        "context": similarity_retriever,
        "question": RunnablePassthrough()
    }
        |
    RunnablePassthrough.assign(response=src_rag_response_chain)
)

In [0]:
query = "What is AI?"
result = rag_chain_w_sources.invoke(query)
result

In [0]:
query = "How do plants survive?"
result = rag_chain_w_sources.invoke(query)
result

# Create End-to-End RAG Evaluation Workflow

![](https://i.imgur.com/GUIkpjy.png)

## Create a Synthetic RAG Golden Reference Dataset

In [0]:
doc_contexts = [doc.page_content for doc in processed_docs]
doc_contexts[:3]

In [0]:
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer import types

In [0]:
synthesizer = Synthesizer(model='gpt-4o',
                          embedder=OpenAIEmbeddings())

eval_data = synthesizer.generate_goldens(
    # Provide a list of context for synthetic data generation
    contexts=[[doc] for doc in doc_contexts],
    include_expected_output=True,
    max_goldens_per_context=1,
    num_evolutions=1,
    scenario="Retrieval Augmented Generation",
    task="Question Answering",
    evolutions={
        types.Evolution.REASONING: 0.1,     # Evolves the input to require multi-step logical thinking.
        types.Evolution.MULTICONTEXT: 0.9,  # Ensures that all relevant information from the context is utilized.
        types.Evolution.CONCRETIZING: 0.0,  # Makes abstract ideas more concrete and detailed.
        types.Evolution.CONSTRAINED: 0.0,   # Introduces a condition or restriction, testing the model's ability to operate within specific limits.
        types.Evolution.COMPARATIVE: 0.0,   # Requires a response that involves a comparison between options or contexts.
        types.Evolution.HYPOTHETICAL: 0.0,  # Forces the model to consider and respond to a hypothetical scenario.
        types.Evolution.IN_BREADTH: 0.0,    # Broadens the input to touch on related or adjacent topics.
    }
)

In [0]:
eval_data[0]

## Save the Synthetic RAG Golden Reference Dataset

In [0]:
import dill

In [0]:
with open('golden_ref_data.bin', 'wb') as f:
    dill.dump(eval_data, f)

## Create RAG Evaluation Dataset

In [0]:
from deepeval.dataset import EvaluationDataset

eval_dataset = EvaluationDataset()

# load golden dataset
with open('golden_ref_data.bin', 'rb') as f:
    golden_docs = dill.load(f)

eval_dataset.goldens = golden_docs

In [0]:
eval_dataset.goldens[0]

In [0]:
eval_dataset.goldens[0].input

In [0]:
rag_chain_w_sources.invoke(eval_dataset.goldens[0].input)

In [0]:
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
from tqdm import tqdm

def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
    test_cases = []
    for golden in tqdm(goldens):
        response_obj = rag_chain_w_sources.invoke(golden.input)
        test_case = LLMTestCase(
            input=golden.input,
            actual_output=response_obj['response'],
            expected_output=golden.expected_output,
            context=golden.context,
            retrieval_context=[doc.page_content for doc in response_obj['context']]
        )
        test_cases.append(test_case)
    return test_cases

In [0]:
eval_dataset.test_cases = convert_goldens_to_test_cases(eval_dataset.goldens)

In [0]:
eval_dataset.test_cases[0]

## Run and View RAG Evaluations on the Evaluation Dataset

In [0]:
from deepeval import evaluate
from deepeval.metrics import ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric

contextual_precision = ContextualPrecisionMetric(threshold=0.5, include_reason=True, model="gpt-4o")
contextual_recall = ContextualRecallMetric(threshold=0.5, include_reason=True, model="gpt-4o")
contextual_relevancy = ContextualRelevancyMetric(threshold=0.5, include_reason=True, model="gpt-4o")
answer_relevancy = AnswerRelevancyMetric(threshold=0.5, include_reason=True, model="gpt-4o")
faithfulness = FaithfulnessMetric(threshold=0.5, include_reason=True, model="gpt-4o")
hallucination = HallucinationMetric(threshold=0.5, include_reason=True, model="gpt-4o")
ragas_answer_relevancy = RAGASAnswerRelevancyMetric(threshold=0.5, embeddings=OpenAIEmbeddings(), model="gpt-4o")

eval_results = evaluate(test_cases=eval_dataset.test_cases,
                        metrics=[contextual_precision, contextual_recall, contextual_relevancy,
                                 answer_relevancy, ragas_answer_relevancy, faithfulness, hallucination])

In [0]:
eval_results.test_results[0]

In [0]:
eval_metrics = []
for result in eval_results.test_results:
    eval_dict = {}
    eval_dict['Input'] = result.input
    eval_dict['Expected Output'] = result.expected_output
    eval_dict['Actual Output'] = result.actual_output
    eval_dict['Context'] = result.context
    eval_dict['Retrieval Context'] = result.retrieval_context
    eval_dict['Success'] = result.success
    metrics = result.metrics_data
    for metric in metrics:
        eval_dict[metric.name+'_Score'] = metric.score
    for metric in metrics:
        eval_dict[metric.name+'_Success'] = metric.success
    for metric in metrics:
        eval_dict[metric.name+'_Reason'] = metric.reason
    eval_metrics.append(eval_dict)

In [0]:
eval_metrics[0]

In [0]:
import pandas as pd

eval_results_df = pd.DataFrame(eval_metrics)
eval_results_df.T

In [0]:
eval_results_df[['Contextual Precision_Score', 'Contextual Recall_Score', 'Contextual Relevancy_Score',
                 'Answer Relevancy_Score', 'Answer Relevancy (ragas)_Score',
                 'Faithfulness_Score', 'Hallucination_Score']].describe()