### Testing RAG Applications 📑

#### RAG Application
This application reads data about Model Context Protocol (MCP) server from internet, stores in vector stores, chunks the data with embedding and useful to answer the question about MCP while inferenced.

<img src="./img/RAG.png" width="500" height="400" style="display: block; margin: auto;">

In [1]:
#!pip install -qU langchain-chroma

#!pip install -U DeepEval

In [2]:
import deepeval

deepeval.login_with_confident_api_key("o6wy2TTe0igTiXs6zs6/JnR+wfzws96MGYfsqGOzntA=")

In [3]:
!deepeval set-ollama deepseek-r1:8b

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.


In [4]:
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.document import Document
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [5]:
from deepeval.metrics import AnswerRelevancyMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.tracing import (
    observe,
    update_current_span,
    RetrieverAttributes
)

In [6]:
@observe(type='llm', model='qwen2.5:latest')
def local_llms():
    # return ChatOllama(
    #     base_url="http://localhost:11434",
    #     model = "qwen2.5:latest",
    #     temperature=0.5,
    #     max_tokens = 250
    # )
    return ChatOpenAI(model="gpt-4.1-2025-04-14", max_completion_tokens=300)
    
llm = local_llms()

In [7]:
# Load data from Web
loader = WebBaseLoader("https://www.descope.com/learn/post/mcp")
data = loader.load()

# Split text into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# Add text to vector db
# embedding = OllamaEmbeddings(model="llama3.2:latest")
embedding = OpenAIEmbeddings(model="text-embedding-3-large")
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

# Create a retriever
retriever = vectordb.as_retriever()

def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([d.page_content for d in docs])


template = """Answer the question based only on the following context:

    {context}
    
    Give a summary not the full detail

    Question: {question}
    """
prompt = ChatPromptTemplate.from_template(template)


@observe(metrics=[AnswerRelevancyMetric()])
def retrieve_and_format(question):
    docs = retriever.invoke(question)
    response = format_docs(docs)
    
    update_current_span(
        test_case=LLMTestCase(input=question, actual_output=response)
    )
    
    return response

chain = {"context": retrieve_and_format, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()


In [8]:
@observe(type="retriever", embedder="text-embedding-3-large")
def retrive_documents(question):
    retrived_context = retrieve_and_format(question)
    
    update_current_span(
        attributes= RetrieverAttributes(
            embedding_input=question,
            retrieval_context= [retrived_context]
        )
    )
    
    return retrived_context



#### Output of the LLM Application

In [9]:

@observe(type="custom", name="RAG Application", metrics=[ContextualRelevancyMetric()])
def rag_application(question):
    actual_response = chain.invoke(question)
    retrived_context = retrive_documents(question)
    
    update_current_span(
        test_case=LLMTestCase(input=question, actual_output=actual_response, retrieval_context=[retrived_context])
    )
    
    return actual_response



### Evaluation of RAG Application

In [10]:
from deepeval.dataset import Golden
from deepeval import evaluate

goldens = Golden(input="What is MCP")
evaluate(goldens=[goldens], observed_callback=rag_application)

Evaluating goldens: |          |  0% (0/1) [Time Taken: 00:00, ?it/s]

Ending trace: [BaseSpan(uuid='adda6f3e-26b9-4ff5-b9c4-31ca57886ee0', status=<TraceSpanStatus.SUCCESS: 'SUCCESS'>, children=[BaseSpan(uuid='ae59fd9a-85de-49a8-8eb0-96b1ed5a1806', status=<TraceSpanStatus.SUCCESS: 'SUCCESS'>, children=[BaseSpan(uuid='f2919f80-7f23-45f1-a8e9-e5a2ab556597', status=<TraceSpanStatus.SUCCESS: 'SUCCESS'>, children=[], trace_uuid='525eb692-c1d8-44cf-828d-a3c1ca310b28', parent_uuid='ae59fd9a-85de-49a8-8eb0-96b1ed5a1806', start_time=304779.015910291, end_time=304779.883650208, name='retrieve_and_format', metadata=None, input={'question': 'What is MCP'}, output="development overhead and enables a more interoperable ecosystem where innovation benefits the entire community—rather than remaining siloed.As MCP continues to progress as a standard, several new developments have appeared on the horizon:Official MCP registry: A maintainer-sanctioned registry for MCP servers is being planned, which will simplify discovery and integration of available tools. This centralized


[A
[A
[A
Evaluating goldens: |██████████|100% (1/1) [Time Taken: 00:46, 46.88s/it]
     ⚡ Invoking traceable callback: |██████████|100% (1/1) [Time Taken: 00:46, 46.88s/it]




Metrics Summary


For test case:

  - input: What is MCP
  - actual output: None
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates





EvaluationResult(test_results=[TestResult(name='test_case_1', success=True, metrics_data=[], conversational=False, multimodal=False, input='What is MCP', actual_output=None, expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link='https://app.confident-ai.com/project/cmb8sq46q07rf1tfo1k6r68x4/evaluation/test-runs/cmbbhq7gc00gcepdzs0lnpcim/compare-test-results')