# 🧠🔍 RAGAS: Evaluating RAG Pipelines on Kubernetes Documentation 🚀

Welcome to this notebook! Here, we explore **Retrieval-Augmented Generation (RAG)** pipelines using the [RAGAS](https://github.com/explodinggradients/ragas) framework, focusing on the Kubernetes documentation.  
You'll learn how to:

- 📚 Ingest and process real-world technical docs
- 🤖 Build and evaluate RAG pipelines for question answering
- 📝 Analyze the quality of generated answers using RAGAS metrics

Let's dive into building robust, trustworthy QA systems with state-of-the-art tools! 🌟

## Library + API keys

In [None]:
!pip install --upgrade --quiet unstructured
!pip install --quiet langchain-community langchain-openai faiss-cpu
!pip install --quiet ragas

In [3]:
# urls for kubernetes documentation
urls = [
    "https://kubernetes.io/docs/concepts/overview/",
    "https://kubernetes.io/docs/concepts/architecture/",
    "https://kubernetes.io/docs/concepts/containers/",
    "https://kubernetes.io/docs/concepts/workloads/",
    "https://kubernetes.io/docs/concepts/storage/"
]

In [4]:
# download NLTK data
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [5]:
from google.colab import userdata
openai_api = userdata.get('OPENAI_API_KEY')

import os
os.environ['OPENAI_API_KEY'] = openai_api

# Data + RAG

In [6]:
# Import required functions / classes
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores.faiss import FAISS


In [7]:
# Load the documents and split them
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=300,
)
loader = UnstructuredURLLoader(urls=urls)
documents = loader.load_and_split(text_splitter)

In [8]:
# Embed the chunks
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents, embeddings)

In [9]:
def rag(query, k = 5):
  # Retrieving the context
  retrieved_docs = db.similarity_search(query, k = k)
  # Retrieving the context
  # retrieved_docs = db.similarity_search(query, k = 5)

  # Combined all the context from the the retrieved docs
  combined_text = "\n\n".join([doc.page_content for doc in retrieved_docs])

  # Define the prompt
  prompt = f""" based on this context: {combined_text} answer the query: {query}"""

  # Call the LLM model
  model = ChatOpenAI(model = "gpt-4.1-mini",
                    temperature = 0)
  response_text = model.invoke(prompt)

  # Return output
  return response_text.content, [txt.page_content for txt in retrieved_docs]

# Text the function
query = "What is the overview of kubernetes?"
rag(query)

("Kubernetes is a portable, extensible, open-source platform designed for managing containerized workloads and services. It facilitates declarative configuration and automation, enabling users to efficiently run distributed systems resiliently. Originating from Google's 15+ years of experience with production workloads, Kubernetes was open-sourced in 2014 and has since developed a large and rapidly growing ecosystem with extensive services, support, and tools.\n\nKubernetes helps manage containers in production environments by automating tasks such as scaling, failover, deployment patterns (e.g., canary deployments), service discovery, load balancing, storage orchestration, automated rollouts and rollbacks, resource optimization (automatic bin packing), self-healing, and secret/configuration management.\n\nA Kubernetes cluster consists of a control plane that manages the cluster and worker nodes that run containerized applications (Pods). The control plane includes components like the 

# Generate Synthetic data

In [10]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator

In [11]:
# LLM and Embeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model = "gpt-4.1-mini",
                                               temperature = 0))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [None]:
# Test Set Generator
generator = TestsetGenerator(llm = generator_llm,
                             embedding_model = generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size = 30)

In [13]:
# Put the test data as a pandas df
test_df = dataset.to_pandas()
test_df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the key features and benefits of usin...,"[Overview\n\nKubernetes is a portable, extensi...","Kubernetes is a portable, extensible, open sou...",single_hop_specifc_query_synthesizer
1,what kubernetes not do for cluster admin?,"[Batch execution In addition to services, Kube...",Kubernetes is not a traditional all-inclusive ...,single_hop_specifc_query_synthesizer
2,"As a Kubernetes Infrastructure Engineer, how d...",[Does not provide nor adopt any comprehensive ...,Kubernetes differs from traditional orchestrat...,single_hop_specifc_query_synthesizer
3,why containers good for cluster admin like me ...,[Containers have become popular because they p...,Containers are popular because they provide ma...,single_hop_specifc_query_synthesizer
4,What are the main components of the Kubernetes...,[Cluster Architecture\n\nThe architectural con...,The Kubernetes control plane consists of sever...,single_hop_specifc_query_synthesizer


In [14]:
# Answer each user_input with our RAG and put it in the df
answers_gen = []
context_gen = []
for user_input in test_df['user_input']:
  answer, context = rag(user_input)
  answers_gen.append(answer)
  context_gen.append(context)

# Store it in the pandas df
test_df['answer'] = answers_gen
test_df['context'] = context_gen

In [15]:
# Preview
test_df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name,answer,context
0,What are the key features and benefits of usin...,"[Overview\n\nKubernetes is a portable, extensi...","Kubernetes is a portable, extensible, open sou...",single_hop_specifc_query_synthesizer,The key features and benefits of using Kuberne...,"[Overview\n\nKubernetes is a portable, extensi..."
1,what kubernetes not do for cluster admin?,"[Batch execution In addition to services, Kube...",Kubernetes is not a traditional all-inclusive ...,single_hop_specifc_query_synthesizer,"Based on the provided context, **Kubernetes do...","[Batch execution In addition to services, Kube..."
2,"As a Kubernetes Infrastructure Engineer, how d...",[Does not provide nor adopt any comprehensive ...,Kubernetes differs from traditional orchestrat...,single_hop_specifc_query_synthesizer,"As a Kubernetes Infrastructure Engineer, you w...",[Does not provide nor adopt any comprehensive ...
3,why containers good for cluster admin like me ...,[Containers have become popular because they p...,Containers are popular because they provide ma...,single_hop_specifc_query_synthesizer,Great question! As a cluster administrator man...,[Containers have become popular because they p...
4,What are the main components of the Kubernetes...,[Cluster Architecture\n\nThe architectural con...,The Kubernetes control plane consists of sever...,single_hop_specifc_query_synthesizer,The main components of the Kubernetes control ...,[Cluster Architecture\n\nThe architectural con...


# Rouge Score

In [16]:
!pip install -q rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [17]:
# Import classes and libraries
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore
import numpy as np

https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/traditional/#rouge-score

In [19]:
# Rouge Score
# Getting the rouge score for each test set row
rouge_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer']
  )

  # Compute the rouge score
  scorer = RougeScore()
  rouge_score = await scorer.single_turn_ascore(sample)

  rouge_score_list.append(rouge_score)

In [20]:
# Print the output
print(rouge_score_list)
print(f"The mean rouge score is {np.mean(rouge_score_list)}")

[0.23448275862068965, 0.164, 0.18725099601593628, 0.17423312883435582, 0.454054054054054, 0.166351606805293, 0.3050847457627119, 0.0978723404255319, 0.3496932515337423, 0.2612244897959184, 0.18446601941747573, 0.14222222222222222, 0.25517241379310346, 0.21508828250401285, 0.4088669950738916]
The mean rouge score is 0.24000422032392926


# LLM based evaluation

In [21]:
# Define evaluation llm and embedding model
eval_llm = LangchainLLMWrapper(ChatOpenAI(model = "gpt-4.1",
                                          temperature = 0))
eval_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

## Simple Scoring

https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#simple-criteria-scoring

In [22]:
from ragas.metrics import SimpleCriteriaScore

In [23]:
# Define the simple scorer
simple_scorer = SimpleCriteriaScore(
    name = "simple scorer",
    definition = "Score 0 to 5 by similarity",
    llm = eval_llm
)

In [24]:
# Get the simple score for every row
simple_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'])
  simple_score = await simple_scorer.single_turn_ascore(sample)
  simple_score_list.append(simple_score)

# Print the outputs
print(simple_score_list)
print(f"The mean simple score is {np.mean(simple_score_list)}")

[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
The mean simple score is 5.0


In [25]:
# inspect the data
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'])
  print(sample)

user_input='What are the key features and benefits of using Kubernetes for managing containerized workloads in a production environment?' retrieved_contexts=None reference_contexts=None response='The key features and benefits of using Kubernetes for managing containerized workloads in a production environment include:\n\n### Key Features of Kubernetes:\n\n1. **Service Discovery and Load Balancing**  \n   Kubernetes can expose containers using DNS names or IP addresses and automatically load balance network traffic to ensure stable deployments under high traffic.\n\n2. **Storage Orchestration**  \n   It allows automatic mounting of storage systems, whether local storage or cloud-based, to containers as needed.\n\n3. **Automated Rollouts and Rollbacks**  \n   Kubernetes manages the desired state of container deployments, enabling controlled updates, rollouts, and rollbacks without downtime.\n\n4. **Automatic Bin Packing**  \n   Kubernetes efficiently schedules containers onto nodes based

## Rubrics Scoring

In [26]:
rubrics = {
    "score1_description": "The response is entirely incorrect and fails to address any aspect of the reference.",
    "score2_description": "The response contains partial accuracy but includes major errors or significant omissions that affect its relevance to the reference.",
    "score3_description": "The response is mostly accurate but lacks clarity, thoroughness, or minor details needed to fully address the reference.",
    "score4_description": "The response is accurate and clear, with only minor omissions or slight inaccuracies in addressing the reference.",
    "score5_description": "The response is completely accurate, clear, and thoroughly addresses the reference without any errors or omissions.",
}

In [27]:
from ragas.metrics import RubricsScore

In [28]:
# Define the rubrics scorer
rubrics_scorer = RubricsScore(
    rubrics = rubrics,
    llm = eval_llm
)

In [29]:
# Compute the rubrics score for each row
rubrics_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'])
  rubrics_score = await rubrics_scorer.single_turn_ascore(sample)
  rubrics_score_list.append(rubrics_score)

# Print the outputs
print(rubrics_score_list)
print(f"The mean rubrics score is {np.mean(rubrics_score_list)}")


[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5]
The mean rubrics score is 4.933333333333334


# RAG-Specific Metrics

## Factual Correctness

In [30]:
from ragas.metrics._factual_correctness import FactualCorrectness

In [31]:
# Define the scorer
factual_scorer = FactualCorrectness(
    llm = eval_llm
)

In [32]:
# Compute the factual correctness for each row
factual_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'])
  factual_score = await factual_scorer.single_turn_ascore(sample)
  factual_score_list.append(factual_score)

# Print the outputs
print(factual_score_list)
print(f"The mean factual score is {np.mean(factual_score_list)}")


[np.float64(0.55), np.float64(0.58), np.float64(0.39), np.float64(0.5), np.float64(0.44), np.float64(0.51), np.float64(0.57), np.float64(0.2), np.float64(0.67), np.float64(0.52), np.float64(0.7), np.float64(0.13), np.float64(0.39), np.float64(0.55), np.float64(0.47)]
The mean factual score is 0.4779999999999999


## Semantic Similarity

In [33]:
from ragas.metrics import SemanticSimilarity

In [34]:
# Get the semantics evaluator
semantic_scorer = SemanticSimilarity(
    embeddings = eval_embeddings
)

In [35]:
# Compute the similatiry for each row
semantic_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'])
  semantic_score = await semantic_scorer.single_turn_ascore(sample)
  semantic_score_list.append(semantic_score)

# Print the outputs
print(semantic_score_list)
print(f"The mean semantic score is {np.mean(semantic_score_list)}")

[0.9327519968948687, 0.917159620743325, 0.9393648944019264, 0.9364604954004442, 0.9579595563980161, 0.9155834224899974, 0.917389274595491, 0.9213169802381165, 0.9690184376434936, 0.9650888033189169, 0.9374523515114948, 0.9435436076101564, 0.9698587816728423, 0.934545579092096, 0.9635641721143208]
The mean semantic score is 0.9414038649417004


## Context Precision

In [36]:
from ragas.metrics import LLMContextPrecisionWithReference

In [37]:
# Get the context scorer
context_scorer = LLMContextPrecisionWithReference(
    llm = eval_llm
)

In [38]:
test_df.head(1)

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name,answer,context
0,What are the key features and benefits of usin...,"[Overview\n\nKubernetes is a portable, extensi...","Kubernetes is a portable, extensible, open sou...",single_hop_specifc_query_synthesizer,The key features and benefits of using Kuberne...,"[Overview\n\nKubernetes is a portable, extensi..."


In [39]:
# # Compute the Precision for each row
# context_score_list = []
# for row in test_df.to_dict(orient = "records"):
#   sample = SingleTurnSample(
#       user_input = row['user_input'],
#       reference = row['reference'],
#       response = row['answer'],
#       retrieved_contexts = row['context'])
#   context_score = await context_scorer.single_turn_ascore(sample)
#   context_score_list.append(context)

# # Print the results
# print(context_score_list)
# print(f"The mean context score is {np.mean(context_score_list)}")

In [40]:
# Compute the Precision for each row
context_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'],
      retrieved_contexts = row['context'])
  context_score = await context_scorer.single_turn_ascore(sample)
  context_score_list.append(context_score)

# Print the results
print(context_score_list)
print(f"The mean context score is {np.mean(context_score_list)}")

[0.94999999997625, 0.7499999999625, 0.9999999999666667, 0.9166666666361111, 0.9999999999, 0.9999999999, 0.99999999995, 0.999999999975, 0.99999999995, 0.9999999999, 0.99999999995, 0.9999999999, 0.9999999999, 0.5333333333155555, 0.9999999999]
The mean context score is 0.943333333272139


## Context Recall

In [41]:
from ragas.metrics import LLMContextRecall

In [42]:
# Define the evaluator for recall
context_recall_scorer = LLMContextRecall(
    llm = eval_llm
)

In [43]:
# Compute the recall score for each row
context_recall_score_list = []
for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'],
      retrieved_contexts = row['context'])
  context_recall_score = await context_recall_scorer.single_turn_ascore(sample)
  context_recall_score_list.append(context_recall_score)

# Print the results
print(context_recall_score_list)
print(f"The mean context recall score is {np.mean(context_recall_score_list)}")


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
The mean context recall score is 1.0


## Response Relevancy

In [44]:
from ragas.metrics import ResponseRelevancy

In [45]:
# Define the response relevancy scorer
response_relevancy_scorer = ResponseRelevancy(
    llm = eval_llm,
    embeddings = eval_embeddings
)

In [46]:
# Compute the Response relevancy
response_relevancy_score_list = []

for row in test_df.to_dict(orient = "records"):
  sample = SingleTurnSample(
      user_input = row['user_input'],
      reference = row['reference'],
      response = row['answer'],
      retrieved_contexts = row['context'])
  response_relevancy_score = await response_relevancy_scorer.single_turn_ascore(sample)
  response_relevancy_score_list.append(response_relevancy_score)

# Print the results
print(response_relevancy_score_list)
print(f"The mean response relevancy score is {np.mean(response_relevancy_score_list)}")
#

[np.float64(0.9999990797323609), np.float64(0.9320783580467125), np.float64(0.9563171693352789), np.float64(0.9446127899111497), np.float64(0.99299322327802), np.float64(0.9684659660137811), np.float64(0.9141063678379585), np.float64(0.9291285164381246), np.float64(0.9999981145537218), np.float64(0.9999999999999996), np.float64(0.9911334292504801), np.float64(0.9565770074907638), np.float64(0.992320066797182), np.float64(0.9659813553929811), np.float64(0.9918188353661104)]
The mean response relevancy score is 0.969035351962975
