## RAG Evaluation with Langsmith

For this evaluation we will be using 3 different types of RAG evaluation (here, `<>` means "compared against"):

1. **Response <> reference answer**: We will measure "*how similar/correct is the answer, relative to a ground-truth*"
2. **Response <> input**: metrics like answer relevance, helpfulness, etc. measure "*how well does the generated response address the initial user input*"
3. **Response <> retrieved docs**: metrics like faithfulness, hallucinations, etc. measure "*to what extent does the generated response agree with the retrieved context*"


<div>
<img src="https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/langsmith_rag_eval.png" alt='langsmith_rag_eval' width="1000"/>
</div>


### Step 1: Create the RAG pipeline 

We will be using LangChain strictly for creating the retriever and retrieving the relevant documents. The overall pipeline does not use LangChain. LangSmith works regardless of whether or not your pipeline is built with LangChain.


In [1]:
# %capture --no-stderr
# ! pip install langsmith langchain-community langchain chromadb tiktoken

First the imports and API keys needed to make the evaluation

In [None]:
import os
import re
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langsmith import Client
import openai
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from langsmith.evaluation import evaluate
from dotenv import load_dotenv, find_dotenv



In [2]:
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
LANGCHAIN_TRACING_V2=True
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_PROJECT="elden_ring_chatbot"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

We build the index using the existing Vector Database created with Chroma.

In [None]:
# INDEX

# Initialize the embedding function
embedding_function = SentenceTransformerEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Path to the vector store directory
persist_directory = "Elden_vector_store"
collection_name = "Elden_Ring_Lore"

# Load the existing vector store
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_function,
    collection_name=collection_name
)

  embedding_function = SentenceTransformerEmbeddings(
  from tqdm.autonotebook import tqdm, trange


Loaded existing vector store.


Create the retriever (the same that we used for the chatbot to keep consistent).

In [9]:
# Set up the retriever
retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 5})

Next, we build a `RAG chain` that returns an `answer` and the retrieved documents as `contexts`.

In [10]:
### RAG

class RagBot:
    def __init__(self, retriever, model: str = "gpt-4-0125-preview"):
        self._retriever = retriever
        # Wrapping the client instruments the LLM
        self._client = wrap_openai(openai.Client())
        self._model = model

    @traceable()
    def retrieve_docs(self, question):
        return self._retriever.invoke(question)

    @traceable()
    def get_answer(self, question: str):
        similar = self.retrieve_docs(question)
        response = self._client.chat.completions.create(
            model=self._model,
            messages=[
                {
                    "role": "system",
                    "content": """You are a lore master of Elden Ring, entrusted to narrate only the stories of the game's world. 
                    Speak in an epic tone as if revealing ancient knowledge but remain precise and truthful providing guidence for the adventurer.

                    Question:
                    {question}

                    Answer:."""
                
                    f"## Docs\n\n{similar}",
                },
                {"role": "user", "content": question},
            ],
        )

        # Evaluators will expect "answer" and "contexts"
        return {
            "answer": response.choices[0].message.content,
            "contexts": [str(doc) for doc in similar],
        }


rag_bot = RagBot(retriever)

In [11]:
response = rag_bot.get_answer("Who are the demigods?")
response["answer"][:150]

'In the era shadowed by the aftershocks of the Shattering, the lands Between are roiled by the sagas of the demigods, offspring of Queen Marika the Ete'

### Step 2: Build up the RAG Dataset 

Next, we build a dataset of QA pairs based upon the documentation that we indexed.

In [12]:
# Sample 
# Pairs extracted for Elden Ring
inputs = [
    "What is the Greater Will and how did it shape the Lands Between?",
    "Who are the Two Fingers and what role do they play in the Golden Order?",
    "What is the significance of the Rune of Death in Elden Ring's lore?",
    "Who is Queen Marika and how did she change the fate of the Lands Between?",
    "What is the origin of the Erdtree and why is it important?",
    "Who is Radagon, and what connection does he have with Queen Marika?",
    "What are the Outer Gods, and how do they influence the world of Elden Ring?",
    "What role does the Tarnished play in the grand scheme of the Elden Ring story?"
]

outputs = [
    "The Greater Will is an Outer God that sent the Elden Beast to the Lands Between, establishing the Golden Order and shaping the world according to its divine plan. Its influence is seen in the governance and faith of the Lands Between.",
    "The Two Fingers serve as envoys of the Greater Will, interpreting its desires and guiding the inhabitants of the Lands Between to maintain the Golden Order. They communicate through Finger Readers and help chosen Tarnished understand their destiny.",
    "The Rune of Death, once part of the Elden Ring, was removed by Queen Marika to prevent the true death of beings. This act created the foundation for the Golden Order, allowing the Erdtree to reabsorb the souls of the deceased and prevent their permanent demise.",
    "Queen Marika is the Empyrean chosen by the Greater Will to embody the Elden Ring. Her decisions, including removing the Rune of Death and shattering the Elden Ring, triggered significant upheavals that set the stage for the game’s events.",
    "The Erdtree, believed to be a manifestation of the Greater Will's power, stands as a beacon of order and life in the Lands Between. Its roots connect to the fundamental flow of souls and the Golden Order, symbolizing divine grace.",
    "Radagon is a mysterious figure who is both the consort of Queen Rennala and later revealed to be part of Queen Marika herself. His dual identity is key to understanding the complex narrative surrounding the Elden Ring's shattering.",
    "The Outer Gods are powerful entities like the Greater Will, the Frenzied Flame, and the Formless Mother. They exert their influence over the Lands Between, often conflicting with one another and affecting the fate of its inhabitants.",
    "The Tarnished are exiled beings called back to the Lands Between to reclaim their grace and pursue the path to becoming the Elden Lord. Their role is to mend the Elden Ring and bring order or chaos, depending on their choices."
]

# Create the QA pairs
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Initialize the LangSmith client
client = Client()

dataset_name = "Elden_Ring_evaluation_3"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs focusing on Elden Ring gameplay, lore, characters, and strategies."
)

# Add examples to the dataset
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

print("QA dataset for Elden Ring Game created successfully.")


QA dataset for Elden Ring Game created successfully.


## Langsmith RAG Evaluators

### Type 1: Answer accuracy

First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.


#### Evaluation flow

We will use an LLM as judge with an customized grader prompt: 

https://smith.langchain.com/hub/langchain-ai/rag-answer-vs-reference

![langsmith_rag_flow.png](images/langsmith_rag_flow.png)

In [13]:
# RAG chain evaluation for Elden Ring-specific questions
def predict_rag_answer(example: dict):
    """
    Use this function to predict answers to questions related to Elden Ring's lore,
    characters, mechanics, or gameplay for evaluation purposes.
    """
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_with_context(example: dict):
    """
    Use this function for detailed evaluation, including retrieved context and checking for hallucinations.
    This is especially useful for verifying if the response includes relevant Elden Ring context.
    """
    response = rag_bot.get_answer(example["question"])
    return {
        "answer": response["answer"],
        "contexts": response["contexts"] 
    }

In [14]:
# Grade prompt for answer accuracy, ensure the prompt aligns with Elden Ring's content.
grade_prompt_answer_accuracy = prompt = hub.pull("langchain-ai/rag-answer-vs-reference")

def answer_evaluator(run, example) -> dict:
    """
    A simple evaluator for RAG answer accuracy.
    """
    input_question = example.inputs["question"]
    reference = example.outputs["answer"]
    prediction = run.outputs["answer"]

    llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

    # Create a formatted input as a string
    prompt_input = (
        f"Evaluate the student's answer based on its accuracy:\n"
        f"Question: {input_question}\n"
        f"Correct Answer: {reference}\n"
        f"Student Answer: {prediction}\n"
        f"Provide a numerical score between 0 and 100 based on accuracy.\n"
        f"Output the score in the following format: 'Accuracy Score: [number]'."
    )

    # Use HumanMessage for structured input
    response = llm.invoke([HumanMessage(content=prompt_input)])
    score_content = response.content  # Access the generated response text

    # Extract the score using a refined regex pattern
    match = re.search(r'Accuracy Score:\s*(\d+)', score_content)
    if match:
        score = int(match.group(1))  # Extract the score after 'Accuracy Score:'
    else:
        raise ValueError("Score could not be extracted from LLM response.")

    print(f'question={input_question}, score={score}')
    return {"key": "answer_score", "score": score}

In [15]:
# Ensure the dataset name matches your Elden Ring-specific dataset
dataset_name = "Elden_Ring_evaluation_3"

# Run the evaluation experiment
experiment_results = evaluate(
    predict_rag_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="elde_bot_rag_accuracy_3",
    metadata={"variant": "Elden Ring context, gpt-3.5-turbo"},
)

print("Evaluation complete. Results are stored in the experiment.")

View the evaluation results for experiment: 'elde_bot_rag_accuracy_3-5d12a156' at:
https://smith.langchain.com/o/8149cca6-21e4-4567-9557-d316bd677644/datasets/c1f83cf2-8ed7-4b15-ab0b-5e5d5b6b1c50/compare?selectedSessions=4606558e-13d2-4a90-827a-0e6619520238




  llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)


question=What is the origin of the Erdtree and why is it important?, score=85
question=Who are the Two Fingers and what role do they play in the Golden Order?, score=95


3it [00:18,  4.98s/it]

question=What role does the Tarnished play in the grand scheme of the Elden Ring story?, score=95
question=What are the Outer Gods, and how do they influence the world of Elden Ring?, score=85
question=Who is Radagon, and what connection does he have with Queen Marika?, score=85


4it [00:22,  4.54s/it]

question=What is the significance of the Rune of Death in Elden Ring's lore?, score=85


7it [00:24,  2.18s/it]

question=What is the Greater Will and how did it shape the Lands Between?, score=85


8it [00:25,  3.17s/it]

question=Who is Queen Marika and how did she change the fate of the Lands Between?, score=85
Evaluation complete. Results are stored in the experiment.





### Type 2: Answer Hallucination

#### Eval flow

We simply use an LLM-as-judge with an easily customized grader prompt: 

https://smith.langchain.com/hub/langchain-ai/rag-answer-hallucination

![langsmith_rag_flow_hallucination.png](images/langsmith_rag_flow_hallucination.png)

In [16]:
# Pull the prompt for evaluating hallucinations
grade_prompt_hallucinations = hub.pull("langchain-ai/rag-answer-hallucination")

def answer_hallucination_evaluator(run, example) -> dict:
    """
    Evaluator for detecting hallucinations in answers related to Elden Ring.
    """
    input_question = example.inputs["question"]
    contexts = run.outputs["contexts"]
    prediction = run.outputs["answer"]

    llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

    # Create prompt for hallucination evaluation
    prompt_input = (
        f"Evaluate if the student's answer is fully supported by the provided documents:\n"
        f"Question: {input_question}\n"
        f"Student Answer: {prediction}\n"
        f"Relevant Contexts:\n{contexts}\n"
        f"Indicate whether the answer includes hallucinations or unsupported content and provide a score below are the specifications:\n"
        f"- hallucination or unsupported content = 1: hallucination detected\n"
        f"- hallucination or unsupported content = 0: no hallucination detected\n"
        f"- Score: 0 for a bad answer, 3 for an okay answer, 5 for an excellent answer\n"
        f"Format your response as: 'Hallucination: [0 or 1], Score: [0, 3, or 5]'."
    )

    # Invoke LLM and parse response
    response = llm.invoke([HumanMessage(content=prompt_input)])
    hallucination_content = response.content

    # Extract hallucination and score using regex
    match = re.search(r'Hallucination:\s*(\d),\s*Score:\s*(\d)', hallucination_content)
    if match:
        hallucination_flag = int(match.group(1))
        score = int(match.group(2))
    else:
        raise ValueError("Hallucination or score could not be extracted from LLM response.")

    return {"key": "answer_hallucination", "hallucination": hallucination_flag, "score": score}

In [17]:
dataset_name = "Elden_Ring_evaluation_3"

experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators=[answer_hallucination_evaluator],
    experiment_prefix="elden_bot_rag_hallucination_3",
    metadata={
        "variant": "Elden Ring context, gpt-3.5-turbo",  # Adjust as needed for model specificity
    },
)

print("Evaluation for Elden Ring answer hallucination detection completed.")

View the evaluation results for experiment: 'elden_bot_rag_hallucination_3-50809ead' at:
https://smith.langchain.com/o/8149cca6-21e4-4567-9557-d316bd677644/datasets/c1f83cf2-8ed7-4b15-ab0b-5e5d5b6b1c50/compare?selectedSessions=06f2256e-e001-47ac-83fe-4301ed33d1a1




8it [00:32,  4.09s/it]

Evaluation for Elden Ring answer hallucination detection completed.





### Type 3: Document Relevance to Question

#### Eval flow

We simply use an LLM-as-judge with an easily customized grader prompt: 

https://smith.langchain.com/hub/langchain-ai/rag-document-relevance

![langsmith_rag_flow_doc_relevance.png](images/langsmith_rag_flow_doc_relevance.png)

In [18]:
# Pull the document relevance grading prompt
grade_prompt_doc_relevance = hub.pull("langchain-ai/rag-document-relevance")

def docs_relevance_evaluator(run, example) -> dict:
    """
    Evaluator for checking document relevance for Elden Ring questions and context.
    """
    input_question = example.inputs["question"]
    contexts = run.outputs["contexts"]
    prediction = run.outputs["answer"]

    llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

    # Create a formatted prompt for document relevance evaluation
    prompt_input = (
        f"Evaluate the relevance of the provided documents for the given question:\n"
        f"Question: {input_question}\n"
        f"Documents:\n{contexts}\n"
        f"Student Answer: {prediction}\n"
        f"Provide a score between 0 and 100 for how relevant the documents are to the question and answer."
        f" Format your response as: 'Score: [number]'."
    )

    # Use HumanMessage for structured input
    response = llm.invoke([HumanMessage(content=prompt_input)])
    relevance_content = response.content  # Access the response text

    # Extract the score using regex or interpret response as needed
    match = re.search(r'Score:\s*(\d+)', relevance_content)
    if match:
        score = int(match.group(1))
    else:
        raise ValueError("Relevance score could not be extracted from LLM response.")

    print(f'question={input_question}, score={score}')
    return {"key": "document_relevance", "score": score}

In [None]:
dataset_name = "Elden_Ring_evaluation_3"

experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators=[docs_relevance_evaluator],
    experiment_prefix="elden_bot_rag_relevance_3",
    metadata={
        "variant": "Elden Ring context, gpt-3.5-turbo",
    },
)

print("Document relevance evaluation for Elden Ring completed successfully.")

View the evaluation results for experiment: 'elden_bot_rag_relevance_3-1da035a5' at:
https://smith.langchain.com/o/8149cca6-21e4-4567-9557-d316bd677644/datasets/c1f83cf2-8ed7-4b15-ab0b-5e5d5b6b1c50/compare?selectedSessions=91eb8a2d-e994-406d-bc85-6f3682674ab0




0it [00:00, ?it/s]

question=Who are the Two Fingers and what role do they play in the Golden Order?, score=85


1it [00:17, 17.45s/it]

question=What are the Outer Gods, and how do they influence the world of Elden Ring?, score=95


2it [00:18,  7.83s/it]

question=Who is Radagon, and what connection does he have with Queen Marika?, score=95


4it [00:19,  3.06s/it]

question=What is the Greater Will and how did it shape the Lands Between?, score=95


5it [00:20,  2.14s/it]

question=What is the significance of the Rune of Death in Elden Ring's lore?, score=95


6it [00:21,  1.75s/it]

question=What is the origin of the Erdtree and why is it important?, score=85


8it [00:24,  3.08s/it]

question=Who is Queen Marika and how did she change the fate of the Lands Between?, score=85
question=What role does the Tarnished play in the grand scheme of the Elden Ring story?, score=0
Document relevance evaluation for Elden Ring completed successfully.





You can find more Langsmith evaluation tutorials in the [official documentation](https://docs.smith.langchain.com/evaluation/tutorials)