# Production RAG Pipeline Evaluation (Proof of Concept)

This notebook evaluates the production Retrieval-Augmented Generation (RAG) pipeline for the `AI Document Workbench` using the [**RAGAS**](https://docs.ragas.io/en/stable/) (Retrieval Augmented Generation Assessment) framework. 

Instead of testing a local sandbox, this Proof of Concept connects directly to the **MongoDB Atlas Vector Search** production database to evaluate real-world retrieval performance.

**Key Objectives:**
1. Fetch live document chunks from the production database.
2. Generate synthetic test questions using `gpt-4o-mini`.
3. Execute the questions through the production RAG chain.
4. Grade the pipeline automatically using `gpt-4o` as an impartial judge.

In [1]:
#!pip install ragas pandas ipykernel
#!pip install rapidfuzz

In [2]:
import os
from pymongo import MongoClient
import pandas as pd
from openai import OpenAI
from datasets import Dataset

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# RAGAS Imports (for evaluation)
from ragas import evaluate
from ragas.metrics import Faithfulness, ContextPrecision
from ragas.llms import llm_factory
from ragas.embeddings import embedding_factory

# LangChain Imports
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_core.runnables import RunnablePassthrough


## 1. Fetching Production Data
We query the live **MongoDB Atlas** cluster and pull a random sample of document chunks that have already been embedded and indexed.

In [3]:
# Connect to Mongodb
client = MongoClient(os.environ["MONGO_URI"])
DB_NAME = "ai_workbench"
COLLECTION_NAME = "documents" 
collection = client[DB_NAME][COLLECTION_NAME]

# Fetchching random chunks for testing
print("Fetching 5 random chunks from existing database...")

pipeline = [{ "$sample": { "size": 5 } }]
random_docs = list(collection.aggregate(pipeline))
test_contexts = []
for doc in random_docs:
    text = doc.get("text") or doc.get("page_content") or ""
    if text:
        test_contexts.append(text)

print(f"Fetched {len(test_contexts)} chunks for testing.")


Fetching 5 random chunks from existing database...
Fetched 5 chunks for testing.


## 2. Synthetic Test Generation (Manual Approach)
We use `gpt-4o-mini` to act as an examiner. For each retrieved document chunk, the AI generates a specific, context-bound question and its corresponding "Ground Truth" answer. 

*Note: This simulatate our application with the same prompt, pipeline and dataset*

In [4]:
# Setup Generator
generator_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Exam Creator Prompt
generation_prompt = ChatPromptTemplate.from_template("""
You are a professor creating an exam. 
Given the following text context, write ONE question that can be answered using ONLY this information.
Also provide the correct answer (ground truth).

Format your output exactly like this:
QUESTION: [The question]
ANSWER: [The correct answer]

CONTEXT:
{context}
""")

chain = generation_prompt | generator_llm | StrOutputParser()

# Generate
data = []
print("Generating questions from our DB data...")

for i, context in enumerate(test_contexts):
    try:
        output = chain.invoke({"context": context})
        
        # Simple parsing
        parts = output.split("ANSWER:")
        question = parts[0].replace("QUESTION:", "").strip()
        ground_truth = parts[1].strip() if len(parts) > 1 else "Error parsing"
        
        data.append({
            "question": question,
            "ground_truth": ground_truth,
            "context": context # Keep context for reference
        })
        print(f" + Completed Q{i+1}")
        
    except Exception as e:
        print(f" - Failed Q{i+1}: {e}")

# Save to DataFrame
test_df = pd.DataFrame(data)
display(test_df.head())

Generating questions from our DB data...
 + Completed Q1
 + Completed Q2
 + Completed Q3
 + Completed Q4
 + Completed Q5


Unnamed: 0,question,ground_truth,context
0,What significant legal decision does Trump cla...,Roe v. Wade,"^ ""Tracking how many key positions Trump has f..."
1,What is one advantage of linear regression men...,It has no local optimum.,as you know it's going down over time and then...
2,What algorithm will be used to find a value of...,Gradient descent,we talk about when we talk about a generalizat...
3,Who are included in the context provided?,"All presidential candidates, Presidents, Third...",All presidential candidates\n Presidents\n Thi...
4,"What act is associated with the phrase ""No tax...",One Big Beautiful Bill Act,White House Faith Office\nEconomic\nArtificial...


## 3. RAG Pipeline Execution
We pass the synthetic questions into the exact LangChain retrieval and generation pipeline used in the live `app.py`. We record the generated `answer` and the `retrieved_contexts` for the final grading step.

In [5]:


# Setup Retriever (Same as app.py)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = MongoDBAtlasVectorSearch(
    collection=collection,
    embedding=embeddings,
    index_name="default" 
)
retriever = vector_store.as_retriever()

# Define Chain (Same as app.py)
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert research assistant. Your goal is to provide accurate, well-structured answers based STRICTLY on the provided context.

    Instructions:
    1. Use ONLY the context provided below. DO NOT use outside knowledge.
    2. Write in PLAIN TEXT only. Do NOT use Markdown formatting.
    3. STRICTLY FORBIDDEN: Do not use bold (**text**), italics (*text*), or headers (#).
    4. You may use simple hyphens (-) for lists, but no other styling.

    Context:
    {context}"""),
    ("human", "{input}"),
])

rag_chain = (
    {"context": retriever, "input": RunnablePassthrough()} 
    | rag_prompt
    | generator_llm
    | StrOutputParser()
)

# Run Evaluation Loop
print("Answering questions using Production RAG...")

answers = []
retrieved_contexts = []

for q in test_df["question"]:
    # Get Answer
    ans = rag_chain.invoke(q)
    answers.append(ans)
    
    docs = retriever.invoke(q)
    retrieved_contexts.append([d.page_content for d in docs])

# Update Data
test_df["answer"] = answers
test_df["retrieved_contexts"] = retrieved_contexts


print("Answers generated. Sample results:")
display(test_df.head())

Answering questions using Production RAG...
Answers generated. Sample results:


Unnamed: 0,question,ground_truth,context,answer,retrieved_contexts
0,What significant legal decision does Trump cla...,Roe v. Wade,"^ ""Tracking how many key positions Trump has f...",Trump claims to have influenced the overturnin...,[Multiple analyses conducted by academic schol...
1,What is one advantage of linear regression men...,It has no local optimum.,as you know it's going down over time and then...,One advantage of linear regression mentioned i...,[as you know it's going down over time and the...
2,What algorithm will be used to find a value of...,Gradient descent,we talk about when we talk about a generalizat...,The algorithm that will be used to find a valu...,[the normal equation looks like so arms of thi...
3,Who are included in the context provided?,"All presidential candidates, Presidents, Third...",All presidential candidates\n Presidents\n Thi...,The context provided includes the following in...,"[Category\n List, All presidential candidates\..."
4,"What act is associated with the phrase ""No tax...",One Big Beautiful Bill Act,White House Faith Office\nEconomic\nArtificial...,"The phrase ""No tax on tips"" is associated with...","[One Big Beautiful Bill Act, In July 2025, Tru..."


## 4. RAGAS Evaluation (LLM-as-a-Judge)


We use `gpt-4o` to grade the RAG pipeline across three critical metrics:

* **Faithfulness (0.0 to 1.0):** Measures hallucination rate. A score of 1.0 means every claim in the generated answer is directly supported by the retrieved context.
* **Context Precision (0.0 to 1.0):** Measures retrieval quality. A high score indicates that MongoDB Atlas successfully ranked the most relevant document chunks at the very top of the search results.

In [6]:
print("Grading with RAGAS...")

# Prepare Dataset
eval_dataset = Dataset.from_pandas(test_df)

# Init OpenAI client
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Setup models
wrapped_critic = llm_factory('gpt-4o', client=openai_client)
wrapped_embeddings = embedding_factory('openai', model='text-embedding-3-small', client=openai_client)

result = evaluate(
    eval_dataset,
    metrics=[
        Faithfulness(),
        ContextPrecision()
    ],
    llm=wrapped_critic,
    embeddings=wrapped_embeddings
)


result.to_pandas().to_csv("rag_evaluation_final.csv", index=False)
print("Final Scores:")
print(result)



Grading with RAGAS...


Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]

Final Scores:
{'faithfulness': 0.8833, 'context_precision': 0.7111}


### Final RAGAS Evaluation Results

| user_input | retrieved_contexts | response | reference | faithfulness | context_precision |
|:---|:---|:---|:---|---:|---:|
| What significant legal decision does Trump claim to have influenced? | ['Multiple analyses conducted by academic scholars... | Trump claims to have influenced the overturning of... | Roe v. Wade... | 1.0000 | 0.5833 |
| What is one advantage of linear regression mentioned in the text? | ["as you know it's going down over time and then i... | One advantage of linear regression mentioned in th... | It has no local optimum.... | 1.0000 | 0.8333 |
| What algorithm will be used to find a value of theta that minimizes the cost function J of theta? | ["the normal equation looks like so arms of this d... | The algorithm that will be used to find a value of... | Gradient descent... | 1.0000 | 0.6389 |
| Who are included in the context provided? | ['Category\n List', 'All presidential candidates\n... | The context provided includes the following indivi... | All presidential candidates, Presidents, Third-par... | 0.9167 | 0.5000 |
| What act is associated with the phrase "No tax on tips"? | ['One Big Beautiful Bill Act', "In July 2025, Trum... | The phrase "No tax on tips" is associated with the... | One Big Beautiful Bill Act... | 0.5000 | 1.0000 |



### Analysis & Key Takeaways

**1. Generator Performance (Faithfulness): `Excellent`**
* The system scored 1.0 on 3 out of 5 questions.
* **Meaning:** Our LLM (`gpt-4o-mini`) is highly reliable. It successfully restricts itself strictly to the provided context and avoids hallucinating outside knowledge. 
* **The Outlier:** Question 5 scored a 0.5. The LLM likely included extra conversational filler or the original material did not provide much data to begin with

**2. Retriever Performance (Context Precision): Needs Tuning**
* Scores fluctuated heavily (0.50 to 1.00). 
* **Meaning:** MongoDB Atlas Vector Search is finding the right information, but it is not always ranking the *best* chunk at the absolute top (Position #1). A score of 0.50 means the correct answer was likely the 2nd or 3rd chunk retrieved.

