# **RAG Evaluation**
- RAG workflow evaluated in 2 parts.
1. Retrieval
2. Generation

# **Different RAG Evaluation Tech:**
**Inside this will have multiple Matrics to evaluate Retrieval or  Generation Part**

- **RAGAS:** https://docs.ragas.io/en/latest/concepts/metrics/context_recall.html
- **TrueLence:** https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad/
- **ARES** :https://github.com/stanford-futuredata/ARES?tab=readme-ov-file#section3

- **ROUGE** is evaluation matric for **Text summrization task**
- **BLEU** is  evaluation matric for **Text translation task**

#### **This script executed in Colab notebook**

- Here in **Langchain** approch we used **Vector-DB Weaviate In memory DB**
  -  **from weaviate.embedded import EmbeddedOptions**


## **Steps Followed for RAG[Langchain / LlamaIndex Framework]**:
- **Ingest Data**  from Website .txt
- Convert data to **Chunks**
- Apply Embedding using **OpenAI Embedding model**
- Store Embedding in **Weaviate In memory DB**
- Use **LLM** Model, here **OpenAI- GPT-3.5-Turbo** used and passed **User Q + Vector DB Retrival + System Prompt**
- This gives **End result** and then applied **Evaluation**

**EVALUATION**
- Here we tried **RAGS** Technique. Inside this will have multiple **Matrics** to **evaluate Retrieval or  Generation Part**. Few example:

- **Retrieval Part Matrics**
  -  context_recall
  -  context_precision

- **Generation Part Matrics**
  - faithfulness
  -  answer_relevancy
  
- **Other**
  - answer_similarity
  - answer_correctness

- Here we **provided sample Question with Groundtruth manually**, but in production scenario, we can use other **LLM model to produce groundtruth or behave as Critic model** to evaluate current model


### To evaluate the RAG pipeline, RAGAs expects the following information:

- **question:** The user query that is the input of the RAG pipeline. The input.
- **answer:** The generated answer from the RAG pipeline. The output.
- **contexts:** The contexts retrieved from the external knowledge source used to answer the question.
- **ground_truths:** The ground truth answer to the question. This is the only human-annotated information. This information is only required for the metric context_recall (see Evaluation Metrics).

In [1]:
!pip install langchain openai weaviate-client ragas -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.6/311.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.0/307.0 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.3/299.3 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.0/116.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

## **Read external data**

In [2]:
import requests
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [3]:
url="https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt"
res = requests.get(url)

In [4]:
res.text[:100]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th'

In [5]:
#Write URL extracted data as state_of_the_union.txt in colab space, It automatically inside content folder
with open("state_of_the_union.txt", "w") as f:
    f.write(res.text)

#Read this state_of_the_union.txt using Langchain document_loaders TextLoader
loader=TextLoader("./state_of_the_union.txt")
documents=loader.load()

## **Chunk the document**

In [6]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter=CharacterTextSplitter(chunk_size=500,chunk_overlap=50)

In [7]:
chunks=text_splitter.split_documents(documents)
chunks[2].page_content

'Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n\nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n\nLet each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.'

In [8]:
len(chunks)


90

## **Embedding and Create Vecor DB**

In [9]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions

In [10]:
import os
from google.colab import userdata

OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"]=OPENAI_API_KEY

In [11]:
#This client helps to save embeddings in weaviate in memory
weaviate_client=weaviate.Client(
    embedded_options=EmbeddedOptions()
)

Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.23.7/weaviate-v1.23.7-Linux-amd64.tar.gz
Started /root/.cache/weaviate-embedded: process ID 635


In [12]:
#Create Weaviate Vector DB
vectorstore=Weaviate.from_documents(
    client=weaviate_client,
    documents=chunks,
    embedding=OpenAIEmbeddings(),
    by_text=False
)

  warn_deprecated(


In [13]:
#Create retriever object from vector db
retriever=vectorstore.as_retriever()

## **Create Langchain prompt and pass to LLM Model with User Q and Retrived O/P**

In [14]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

#### **Use LLM Model - gpt-3.5-turbo**

In [15]:
llm=ChatOpenAI(model_name="gpt-3.5-turbo",temperature=0.2)

  warn_deprecated(


#### **Langchain prompt template**

In [16]:
# Define prompt template
template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use two sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""

In [17]:
prompt = ChatPromptTemplate.from_template(template)


In [18]:
#Create Chain pipeline
rag_chain=(
     #Runtime Q assigned to question and then passed to retriever and retrivers O/P passed as Context
    {"context":retriever, "question":RunnablePassthrough()}
          | prompt
          | llm
          | StrOutputParser()
)

In [55]:
#Test/Invoke LLM Model
res = rag_chain.invoke("what did the President say about Justic Breyer?")
res

"The President honored Justice Stephen Breyer for his service and mentioned that Judge Ketanji Brown Jackson will continue Justice Breyer's legacy of excellence."

In [19]:
res.text[:100]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th'

# **Evaluation - RAGAS**
- Create **List of user Q**
- Create **List of groud_truths**

In [20]:
# List of user Q
questions=["what did the President say about Justic Breyer?",
           "What did the President say abou Intel's CEO?",
           "What did the President say about gun violence?"

           ]

In [21]:
# List of Ground Truth/Actual Ans for User Q
#WARNING:ragas.validation:passing column names as 'ground_truths' is deprecated
#and will be removed in the next version, please use 'ground_truth' instead.
#Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths

#If you are not interested in the **context_recall** metric, you don’t need to provide the ground_truths information.

ground_truths=[["The president said that Justice Breyer has dedicated his life to serve the country and thanked him for his service."],
              ["The president said that Pat Gelsinger is ready to increase Intel's investment to $100 billion."],
              ["The president asked Congress to pass proven measures to reduce gun violence."]]

In [22]:
#Create List
answer=[]  #Generated answer
context=[]  # Context for User Q frm vector DB

In [23]:
#Loop through all Q and get Generated ans and Context
for query in questions:
  answer.append(rag_chain.invoke(query))  #Generated ans
  context.append([docs.page_content  for docs in retriever.get_relevant_documents(query)]) # Context for User Q frm vector DB

  warn_deprecated(


In [24]:
answer

["The President honored Justice Stephen Breyer for his service and mentioned nominating Judge Ketanji Brown Jackson to continue Breyer's legacy.",
 "The President mentioned that Intel's CEO, Pat Gelsinger, is ready to increase their investment from $20 billion to $100 billion, which would be one of the biggest investments in manufacturing in American history.",
 'The President called for Congress to pass measures to reduce gun violence, including universal background checks and a ban on assault weapons and high-capacity magazines. He emphasized the need to crack down on gun trafficking and ghost guns to keep communities safe.']

In [25]:
context

[['Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.',
  'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
  'A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Borde

## **Create Dictionary of Q,Uns,Context,Groundtruth**
- User Q,
- Genereated Ans,
- Vector db extracted Context,
- Ground truth

In [26]:
# Create To dict
data = {
    "question": questions,
    "answer": answer,
    "contexts": context,
    "ground_truths": ground_truths
}

## **Convert dictionary as Hugging face dataset format**

In [27]:
#!pip install Dataset

In [28]:
#Convert dictionary as Hugging face dataset format
from datasets import Dataset
dataset=Dataset.from_dict(data)
dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 3
})

## **Apply RAG Evaluation Technique**

In [29]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

In [30]:
#Some time it fails in this part, terminate session and rerun
#Main RAG Evaluation
result = evaluate(
    dataset = dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)



Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

In [51]:
#Average Matrics value over 3 records
result

{'context_precision': 1.0000, 'context_recall': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.8687}

In [32]:
#Convert to pandas df
evaluation_df=result.to_pandas()
evaluation_df

Unnamed: 0,question,answer,contexts,ground_truths,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,what did the President say about Justic Breyer?,The President honored Justice Stephen Breyer f...,"[Tonight, I’d like to honor someone who has de...",[The president said that Justice Breyer has de...,The president said that Justice Breyer has ded...,1.0,1.0,1.0,0.825979
1,What did the President say abou Intel's CEO?,"The President mentioned that Intel's CEO, Pat ...",[But that’s just the beginning. \n\nIntel’s CE...,[The president said that Pat Gelsinger is read...,The president said that Pat Gelsinger is ready...,1.0,1.0,1.0,0.866591
2,What did the President say about gun violence?,The President called for Congress to pass meas...,[And I ask Congress to pass proven measures to...,[The president asked Congress to pass proven m...,The president asked Congress to pass proven me...,1.0,1.0,1.0,0.913553


In [54]:
#import pandas as pd
#pd.set_option('display.max_colwidth', 100)
evaluation_df.iloc[[0]]

Unnamed: 0,question,answer,contexts,ground_truths,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,what did the President say about Justic Breyer?,The President honored Justice Stephen Breyer for his service and mentioned nominating Judge Keta...,"[Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice St...",[The president said that Justice Breyer has dedicated his life to serve the country and thanked ...,The president said that Justice Breyer has dedicated his life to serve the country and thanked h...,1.0,1.0,1.0,0.825979


In [47]:
#data
#this is my actual User query
evaluation_df.iloc[0,:].question

'what did the President say about Justic Breyer?'

In [48]:
#predicted
#generation of the model
evaluation_df.iloc[0,:].answer

"The President honored Justice Stephen Breyer for his service and mentioned nominating Judge Ketanji Brown Jackson to continue Breyer's legacy."

In [49]:
#actual/Ground Truth
#actual answer(manually we have written this answer)(this could be llm generated)
evaluation_df.iloc[0,:].ground_truth

'The president said that Justice Breyer has dedicated his life to serve the country and thanked him for his service.'

In [50]:
#Retrived Context
#retrieval result(context)
evaluation_df.iloc[0,:].contexts[0]

'Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.'

In [None]:
"""
context_precision                                                  1.0
context_recall                                                     1.0
faithfulness                                                       1.0
answer_relevancy                                               0.83073
"""

# **Try our own LLM Model and embedding to evaluate**
- By default RAGA'S use OPENAI chat 3.5 Turbo 16 model, so we can change to our own LLM modelbe this way(Mentioning LLM in evaluate part)

In [None]:
"""
from langchain_openai.chat_models import AzureChatOpenAI
from langchain_openai.embeddings import AzureOpenAIEmbeddings

azure_model = AzureChatOpenAI(
openai_api_version = "",
azure_endpoint = "",
azure_deployment = "",
model = "",
validate_base_url  = ""
)

azure_embeddings = AzureOpenAIEmbeddings(
openai_api_version = "",
azure_endpoint = "",
azure_deployment = "",
model = "",
validate_base_url  = ""
)

result = evaluate(dataset = dataset,
                  llm = azure_model,
                  embeddings = azure_embeddings,
                  #is_async = True,
                  metrics=[context_precision,
                           context_recall,
                           faithfulness,
                           answer_relevancy,
                           ],
                  )
"""

In [None]:
"""
#Now lets create an Langchain llm instance and wrap it with RagasLLM class.
#Because vLLM can run in OpenAI compatibilitiy mode, we can use the ChatOpenAI class as it is with small tweaks.

from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

inference_server_url = "http://localhost:8080/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(
    model="HuggingFaceH4/zephyr-7b-alpha",
    openai_api_key="no-key",
    openai_api_base=inference_server_url,
    max_tokens=5,
    temperature=0,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLMWrapper(chat)

#  In ragas.evaluate class, it uses default LLM model is OpenAI's gpt-3.5-turbo-16k, It may fail with token Limit
#To avoid this issue we can Bring our own LLMs as vllm
context_precision.llm = vllm
context_recall.llm = vllm
faithfulness.llm = vllm
answer_relevancy.llm = vllm


#WARNING:ragas.validation:passing column names as 'ground_truths' is deprecated
#and will be removed in the next version, please use 'ground_truth' instead.
#Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths


result=evaluate(
    dataset=dataset,
    #List of Evaluation metrics
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ]
)
"""

# **Test Evaluation matrics value with different sample with answer hardcoded**
- answer_similarity
- answer_correctness

In [None]:
from datasets import Dataset
from ragas.metrics import answer_similarity
from ragas import evaluate


data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_similarity]) # Check only answer_similarity
score.to_pandas()

In [None]:
from datasets import Dataset
from ragas.metrics import faithfulness, answer_correctness
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_correctness]) # Check only answer_correctness
score.to_pandas()

# **END**