In [1]:
# !pip install -q ragas

## Simple Evaluation of RAG with Ragas

### Resources
1. https://bishalbose294.medium.com/demystifying-ragas-a-deep-dive-into-evaluating-retrieval-augmented-generation-pipelines-part-3-e6ba7541543c

2. https://docs.ragas.io/en/v0.1.21/concepts/index.html
3. https://namratanwani.medium.com/evaluate-rag-with-ragas-e1ad1aa99c2e


In [3]:
import os
import openai
from ragas import evaluate
from datasets import Dataset 
from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

from langchain_google_genai import ChatGoogleGenerativeAI

from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

from langchain_community.vectorstores import Chroma
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.document_loaders import SeleniumURLLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, context_entity_recall, answer_similarity, answer_correctness


In [5]:
# !pip install -q selenium

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.4 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.6.18 which is incompatible.


In [7]:
# !pip install -q unstructured

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 5.23.3 requires aiofiles<24.0,>=22.0, but you have aiofiles 24.1.0 which is incompatible.


In [8]:
load_dotenv()


urls = [
    "https://en.wikipedia.org/wiki/New_York_City",
    "https://en.wikipedia.org/wiki/Snow_leopard",
    "https://www.britannica.com/place/Galapagos-Islands",
    "https://www.birdlife.org/birds/penguins/#:~:text=The%20threats%20are%20numerous%2C%20including,is%20melting%20before%20their%20eyes."
]

# collect data using selenium url loader
loader = SeleniumURLLoader(urls=urls)
documents = loader.load()

These documents need to be transformed to string for preprocessing and ultimately, arranged as a list of string rather than list of documents.

In [9]:
documentList = []
for doc in documents:
    d = str(doc.page_content).replace("\\n", " ").replace("\\t"," ").replace("\n", " ").replace("\t", " ")
    documentList.append(d)

We now perform sematic chunking which simply creates chunks of documents based on semantic similarity between sentences. Two sentences will fall in the same chunk if they are semantically similar. To find semantically similar data, we require dense embeddings of sentences, therefore, we use sentence transformers all-MiniLM-L6-v2 model to generate these embeddings.

Further, these chunked documents are stored as in Chroma’s vector store in a folder called chroma_db. We utilize the same sentence transformers model to store these documents as embeddings.

In [10]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
text_splitter = SemanticChunker(embedding_function)
docs = text_splitter.create_documents(documentList)

# storing embeddings in a folder
vector_store = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
# use this to load vector database
vector_store = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)


  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  vector_store = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)


In [11]:
PROMPT_TEMPLATE = """
Go through the context and answer given question strictly based on context. 
Context: {context}
Question: {question}
Answer:
"""
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
        llm = llm,
        # retriever=vector_store.as_retriever(search_kwargs={'k': 3}),
        retriever=vector_store.as_retriever(),
        return_source_documents=True,
        chain_type_kwargs={"prompt": PromptTemplate.from_template(PROMPT_TEMPLATE)}
    )


We define our queries and ground truths for evaluation.

In [12]:
queries = [
    "Who discovered the Galapagos Islands and how?",
    "What is Brooklyn–Battery Tunnel?",
    "Are Penguins found in the Galapagos Islands?",
    "How many languages are spoken in New York?",
    "In which countries are snow leopards found?",
    "What are the threats to penguin populations?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "How did Galapagos Islands get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
    
]

ground_truths = [
    "The Galapagos Islands were discovered in 1535 by the bishop of Panama, Tomás de Berlanga, whose ship had drifted off course while en route to Peru. He named them Las Encantadas (“The Enchanted”), and in his writings he marveled at the thousands of large galápagos (tortoises) found there. Numerous Spanish voyagers stopped at the islands from the 16th century, and the Galapagos also came to be used by pirates and by whale and seal hunters. ",
    "The Brooklyn-Battery Tunnel (officially known as the Hugh L. Carey Tunnel) is the longest continuous underwater vehicular tunnel in North America and runs underneath Battery Park, connecting the Financial District in Lower Manhattan to Red Hook in Brooklyn.[586]",
    "Penguins live on the galapagos islands side by side with tropical animals.",
    "As many as 800 languages are spoken in New York.",
    "Siberia, Tajikistan, Kyrgyzstan, Uzbekistan, Kazakhstan, Afghanistan, Pakistan, India, Nepal, Bhutan, Mongolia, and Tibet.",
    "The threats are numerous, including habitat loss, pollution, disease, and reduced food availability due to commercial fishing. Climate change is of particular concern for many species of penguin, as the sea ice that they depend on to find food or build nests is melting before their eyes.",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "Tomás de Berlanga, who discovered the islands, named them Las Encantadas (“The Enchanted”), and in his writings he marveled at the thousands of large galápagos (tortoises) found there. Numerous Spanish voyagers stopped at the islands from the 16th century, and the Galapagos also came to be used by pirates and by whale and seal hunters.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
    
]

Finally, we generate results and store context supplied to the LLM.

In [13]:
results = []
contexts = []
for query in queries:
    result = qa_chain({"query": query})
   
    results.append(result['result'])
    sources = result["source_documents"]
    contents = []
    for i in range(len(sources)):
        contents.append(sources[i].page_content)
    contexts.append(contents)

  result = qa_chain({"query": query})


The evaluation dataset is prepared as a dictionary and the evaluation metrics are calculated for each query using Ragas. These scores are further stored as a csv file.

In [14]:

d = {
    "question": queries,
    "answer": results,
    "contexts": contexts,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(d)
score = evaluate(dataset,metrics=[faithfulness, answer_relevancy, context_precision, context_recall, context_entity_recall, answer_similarity, answer_correctness,])
score_df = score.to_pandas()
score_df.to_csv("EvaluationScores.csv", encoding="utf-8", index=False)

Evaluating:   0%|          | 0/70 [00:00<?, ?it/s]

In [15]:
score_df[['faithfulness','answer_relevancy', 'context_precision', 'context_recall',
       'context_entity_recall', 'answer_similarity', 'answer_correctness']].mean(axis=0)

faithfulness             0.966667
answer_relevancy         0.985395
context_precision        0.708333
context_recall           0.866667
context_entity_recall    0.226071
answer_similarity        0.945248
answer_correctness       0.767146
dtype: float64

The RAG pipeline exhibits strong performance in terms of faithfulness, answer relevancy, context recall, and answer similarity. However, there are areas for improvement, particularly in context entity recall, context precision, answer correctness, to enhance the overall quality and accuracy of the responses generated by the model.