### 📖 Where We Are

**In the previous sections**, we have mastered building a variety of sophisticated RAG pipelines, from simple retrievers to complex, self-correcting agents. We've learned how to load data, embed it, and use different retrieval strategies to provide context to an LLM.

**But how do we know if our RAG system is actually *good*?** How can we measure its performance, compare different versions, and identify areas for improvement? 

**In this new section on RAG Evaluation**, we will answer those questions. This notebook introduces the critical practice of evaluating RAG pipelines. We will learn how to use **LangSmith** to create test datasets and implement a powerful pattern called **LLM-as-a-Judge** to automatically score our system's performance on key metrics like correctness, relevance, and groundedness.

### 1. The System Under Test: A Basic RAG Pipeline
First, let's build the RAG application that we want to evaluate. This will be a standard pipeline that retrieves documents from a vector store and uses them to generate an answer.

In [None]:
# --- Environment and Library Setup ---
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_groq import ChatGroq

load_dotenv()
os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

In [1]:
## RAG
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# List of URLs to load documents from
urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]

# Load documents from the URLs
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

# Initialize a text splitter with specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)

# Split the documents into chunks
doc_splits = text_splitter.split_documents(docs_list)

# Add the document chunks to the "vector store" using OpenAIEmbeddings
vectorstore = InMemoryVectorStore.from_documents(
    documents=doc_splits,
    embedding=HuggingFaceEmbeddings(model="all-MiniLM-L6-v2"),
)

# With langchain we can easily turn any vector store into a retrieval component:
retriever = vectorstore.as_retriever(k=6)

USER_AGENT environment variable not set, consider setting it to identify your requests.
  from .autonotebook import tqdm as notebook_tqdm


In [8]:
# --- The RAG Bot Function ---
from langsmith import traceable

# The `@traceable` decorator automatically logs the inputs, outputs, and any errors of this function to LangSmith.
@traceable()
def rag_bot(question: str) -> dict:
    """Our simple RAG pipeline function that we want to evaluate."""
    # 1. Retrieve relevant documents.
    docs = retriever.invoke(question)
    docs_string = " ".join(doc.page_content for doc in docs)

    # 2. Create a system prompt with the retrieved context.
    instructions = f"""You are a helpful assistant. Use the following source documents to answer the user's questions. 
    If you don't know the answer, just say that you don't know. Keep the answer concise.
    Documents: {docs_string}"""
    
    # 3. Generate the final answer.
    llm = ChatGroq(model="openai/gpt-oss-20b")
    ai_msg = llm.invoke([
         {"role": "system", "content": instructions},
         {"role": "user", "content": question},
    ])
    
    # The function must return a dictionary with the 'answer' and the retrieved 'documents'.
    
    
    return {"answer": ai_msg.content, "documents": docs}

### 2. Creating an Evaluation Dataset in LangSmith

To evaluate our system, we need a benchmark. In LangSmith, this is a **dataset** of questions and their corresponding ground-truth answers. This allows us to test our RAG bot's performance on a consistent set of examples.

**Analogy: A Final Exam for Your RAG Bot 📝**

Think of this dataset as the final exam for your RAG application. Each question-answer pair is like a question on the test. By running your bot against this exam, you can get a clear score on how well it performs and identify which questions it struggles with.

In [9]:
from langsmith import Client

# Initialize the LangSmith client.
client = Client()

# Define the questions and ground-truth answers for our test dataset.
examples = [
    {
        "inputs": {"question": "How does the ReAct agent use self-reflection?"},
        "outputs": {"answer": "ReAct integrates reasoning and acting by performing actions with tools and then observing the outputs to inform its next step."},
    },
    {
        "inputs": {"question": "What are the types of biases that can arise with few-shot prompting?"},
        "outputs": {"answer": "The biases that can arise with few-shot prompting include majority label bias, recency bias, and common token bias."},
    },
    {
        "inputs": {"question": "What are five types of adversarial attacks on LLMs?"},
        "outputs": {"answer": "Five types of adversarial attacks are token manipulation, gradient-based attacks, jailbreak prompting, human red-teaming, and model red-teaming."},
    }
]

# Create the dataset in LangSmith.
dataset_name = "RAG_Evaluation_Test_v1"
dataset = client.create_dataset(dataset_name=dataset_name, description="Test dataset for RAG pipeline evaluation.")
# Upload the examples to the dataset.
client.create_examples(dataset_id=dataset.id, examples=examples)

LangSmithConflictError: Conflict for /datasets. HTTPError('409 Client Error: Conflict for url: https://api.smith.langchain.com/datasets', '{"detail":"Dataset with this name already exists."}')

### 3. Implementing Evaluators (LLM-as-a-Judge)

Manually grading hundreds of RAG outputs is impractical. Instead, we use a powerful pattern called **LLM-as-a-Judge**. We use another, powerful LLM (the "judge") and give it a very specific prompt (the "rubric") to automatically score our RAG system's performance on different metrics.

#### Evaluator 1: Correctness
This measures how factually similar the RAG system's answer is to the ground-truth answer in our dataset.

In [None]:
from typing_extensions import Annotated, TypedDict
from langchain_groq import ChatGroq
# Define the structured output schema for our grader.
class CorrectnessGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]

# Define the detailed instructions (the "rubric") for the grader LLM.
correctness_instructions = """You are a teacher grading a quiz. You will be given a QUESTION, the GROUND TRUTH answer, and a STUDENT ANSWER. 
Grade the student's answer based ONLY on its factual accuracy compared to the ground truth. It is OK if the student's answer contains more information, as long as it does not contradict the ground truth."""

# Create the grader LLM, forcing it to return the structured output.
grader_llm = ChatGroq(model="openai/gpt-oss-20b", temperature=0).with_structured_output(CorrectnessGrade)

# Define the evaluator function that LangSmith will call.
def correctness(run, example) -> dict:
    """An evaluator for RAG answer accuracy."""
    inputs = example.inputs
    outputs = run.outputs
    reference_outputs = example.outputs
    
    # Combine the information into a single string for the grader.
    answers = f"""QUESTION: {inputs['question']}
    GROUND TRUTH ANSWER: {reference_outputs['answer']}
    STUDENT ANSWER: {outputs['answer']}"""
    
    # Run the grader.
    grade = grader_llm.invoke([{"role": "system", "content": correctness_instructions}, {"role": "user", "content": answers}])
    return {"key": "correctness", "score": grade["correct"]}

#### Evaluator 2: Groundedness
This is one of the most important RAG metrics. It checks if the RAG system's answer is supported by the documents it retrieved. This is how we measure **hallucinations**.

In [10]:
class GroundedGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    grounded: Annotated[bool, ..., "Provide the score on if the answer is grounded in the facts."]

grounded_instructions = """You are a grader checking for hallucinations. You will be given a set of FACTS (retrieved documents) and a STUDENT ANSWER.
Your task is to determine if the student's answer is fully supported by the provided facts. The answer is grounded if all claims made in it can be verified from the facts."""

grounded_llm = ChatGroq(model="openai/gpt-oss-20b", temperature=0).with_structured_output(GroundedGrade)

def groundedness(run, example) -> dict:
    outputs = run.outputs
    # The 'documents' must be returned by our RAG bot function to be used here.
    doc_string = "\n\n".join(doc.page_content for doc in outputs["documents"])
    answer = f"FACTS: {doc_string}\nSTUDENT ANSWER: {outputs['answer']}"
    grade = grounded_llm.invoke([{"role": "system", "content": grounded_instructions}, {"role": "user", "content": answer}])
    return {"key": "groundedness", "score": grade["grounded"]}

### 4. Running the Evaluation
Now we use the `evaluate` function to orchestrate the entire test. It will run our `rag_bot` on every example in our dataset and then apply each of our evaluator functions to the results.

In [11]:
from langsmith.evaluation import evaluate

# Define the function to be evaluated.
def target(inputs: dict) -> dict:
    return rag_bot(inputs["question"])

# Run the evaluation.
experiment_results = evaluate(
    target, # The RAG system to test.
    data=dataset_name, # The name of the LangSmith dataset.
    evaluators=[correctness, groundedness], # The list of our custom grading functions.
    experiment_prefix="rag-evaluation-run", # A name for this test run.
    metadata={"version": "Initial RAG pipeline"},
)

View the evaluation results for experiment: 'rag-evaluation-run-f18d3202' at:
https://smith.langchain.com/o/5cf2e409-b5ba-4812-a6b3-85a090392b44/datasets/df74f216-0950-4396-b730-6f2ab1da6aee/compare?selectedSessions=beb68290-1d15-4131-b9bb-0376dd97ae31




3it [00:16,  5.34s/it]


### 🔑 Key Takeaways

* **Evaluation is Essential**: To build reliable RAG systems, you must move beyond anecdotal testing and adopt a structured evaluation process. This is the key to measuring performance and making targeted improvements.
* **LangSmith for Experimentation**: LangSmith is a powerful platform for LLM Ops. It allows you to create curated **datasets** of test cases and run repeatable **experiments** to benchmark your application's performance.
* **LLM-as-a-Judge**: This is a powerful and scalable pattern for automating evaluation. By using a strong LLM with a detailed prompt (a rubric) and a structured output schema, you can automatically grade your RAG system's outputs.
* **Key Evaluation Metrics**: For a RAG system, you need to measure multiple aspects:
    - **Correctness**: Is the answer factually accurate compared to a ground truth?
    - **Groundedness**: Is the answer supported by the retrieved context? (Anti-hallucination metric)
    - **Relevance**: Does the answer actually address the user's question?
    - **Retrieval Relevance**: Did the retriever fetch useful documents in the first place?