# Introduction to RAGAs


- RAGAs stands for Retrieval-Augmented Generation Assessment, which is a framework provide insights to RAG pipeline evaluation
- RAGAs leverages LLMs under the hood to conduct the evaluations


## data format

to use RAGAs, the data must be formatted as:

- `question`: The user query that is the input of the RAG pipeline. The input. These are questions the RAG pipeline will be evaluated on
- `contexts`: The contexts retrieved from the external knowledge source used to answer the question.
- `ground_truth`: The ground truth answer to the question. This is the only human-annotated information. This information is only required for the metric context_recall (see Evaluation Metrics)

- `answer`: The **generated** answer from the RAG pipeline. The output.


## evaluation metrics

### retriever

- `context_precision`: measures how relevant the retrieved context is to the question, **the quality of the pipeline**
- `context_recall`: measures the retriever's ability to retrieve all necessary information

### generator

- `faithfulness`: measures the factual consistency to the context based on the question
- `answer_relevancy`: measures how relevant the answer is


# load dependencies


In [9]:
import os
import warnings
from pathlib import Path

import openai
import requests
import weaviate
from datasets import Dataset
from dotenv import find_dotenv, load_dotenv
from langchain_community.document_loaders import TextLoader

from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Weaviate
from langchain_openai import ChatOpenAI
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy,
    faithfulness,
)
from weaviate.embedded import EmbeddedOptions

warnings.filterwarnings("ignore")
_ = load_dotenv(find_dotenv("../.env"))
openai.api_key = os.getenv("OPENAI_API_KEY")

# load data (raw text)


In [14]:
data_path = Path("../data/llm_eval/state_of_the_union.txt")

if not os.path.exists(data_path):

    url = "https://raw.githubusercontent.com/langchain-ai/langchainjs/main/examples/state_of_the_union.txt"
    res = requests.get(url)

    with open(data_path, "w") as f:
        f.write(res.text)

In [15]:
# load the data
loader = TextLoader(data_path)
documents = loader.load()

In [16]:
# chunk the data

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

In [19]:
print(f"total {len(chunks)} chunks")

total 90 chunks


In [20]:
# index the data

# Setup vector database
client = weaviate.Client(embedded_options=EmbeddedOptions())

# populate vector daatabase
vectorstore = Weaviate.from_documents(
    client=client, documents=chunks, embedding=OpenAIEmbeddings(), by_text=False
)

# Define vectorstore as retriever to enable semantic search
retriever = vectorstore.as_retriever()

Started /Users/z/.cache/weaviate-embedded: process ID 12749


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-12T01:41:51+08:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-12T01:41:51+08:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-12T01:41:51+08:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50060","time":"2024-09-12T01:41:51+08:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-09-12T01:41:51+08:00"}
{"level":"info","msg":"Created shard langchain_c9631a309d5e4c30b963b06f2d7a02ad_RUZVLPjnxA2U in 3.263916ms","time":"2024-09-12T01:41:51+08:00"}
{"action"

# establish a template & setup rag chain


In [21]:
# define llm
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# Define prompt template
template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use two sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""

prompt = ChatPromptTemplate.from_template(template)

# setup rag pipeline
# 'RunnablePassthrough' allows data to be passed thru without modification
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# create dataset


In [36]:
questions = [
    "What did the president say about Justice Breyer?",
    "What did the president say about Intel's CEO?",
    "What did the president say about gun violence?",
]
ground_truth = [
    "The president said that Justice Breyer has dedicated his life to serve the country and thanked him for his service.",
    "The president said that Pat Gelsinger is ready to increase Intel's investment to $100 billion.",
    "The president asked Congress to pass proven measures to reduce gun violence.",
]
answers = []
contexts = []

# inference
for query in questions:
    answers.append(rag_chain.invoke(query))
    contexts.append(
        [docs.page_content for docs in retriever.get_relevant_documents(query)]
    )


data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truth,
}

dataset = Dataset.from_dict(data)

# evaluate


In [37]:
# select metrics
metrics = [
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
    context_precision,
]

# evalute
result = evaluate(
    dataset=dataset,
    metrics=metrics,
)

Evaluating:   0%|          | 0/15 [00:00<?, ?it/s]

In [38]:
result.to_pandas()

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_relevancy,context_recall,context_precision
0,What did the president say about Justice Breyer?,The president honored Justice Stephen Breyer f...,"[Tonight, I’d like to honor someone who has de...",The president said that Justice Breyer has ded...,1.0,0.812966,0.066667,1.0,1.0
1,What did the president say about Intel's CEO?,"The president mentioned that Intel's CEO, Pat ...",[But that’s just the beginning. \n\nIntel’s CE...,The president said that Pat Gelsinger is ready...,0.5,0.806654,0.038462,1.0,1.0
2,What did the president say about gun violence?,The president called for Congress to pass meas...,[And I ask Congress to pass proven measures to...,The president asked Congress to pass proven me...,1.0,0.908946,0.35,1.0,0.75


Failed to batch ingest runs: LangSmithRateLimitError('Rate limit exceeded for https://api.smith.langchain.com/runs/batch. HTTPError(\'429 Client Error: Too Many Requests for url: https://api.smith.langchain.com/runs/batch\', \'{"detail":"Monthly unique traces usage limit exceeded"}\')')


# Reference

- [Evaluating RAG Applications with RAGAs](https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a)
- [RAGAs official document](https://docs.ragas.io/en/latest/getstarted/evaluation.html)
