# Building the Entire RAG Ecosystem and Optimizing Every Component

![alt text](architecture.png)

1. **Query Transformations:** Rewriting user questions to be more effective for retrieval.
2. **Intelligent Routing:** Directing a query to the correct data source or a specialized tool.
3. **Indexing:** Creating a multi-layered knowledge base.
4. **Retrieval and Re-ranking:** Filtering noise and prioritizing the most relevant context.
5. **Self-Correcting Agentic Flows:** Building systems that can grade and improve their own work.
6. **End-to-End Evaluation:** Objectively measuring the performance of the entire pipeline.

![ale text](simplerag.png)

* **Indexing:** Organize and store data in a structured format to enable efficient searching.
* **Retrieval:** Search and fetch relevant data based on a query or input.
* **Generation:** Create a final response or output using the retrieved data.

### Indexing Phase

![alt text](indexing.png)

In [103]:
from dotenv import load_dotenv
import os
load_dotenv()

True

In [9]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Initialize a web document loader with specific parsing instructions
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-header", "post-title")
        )
    ),
)

docs = loader.load()

We need to break the document into smaller, semantically meaningful pieces.

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

splits = text_splitter.split_documents(docs)

In [14]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings()
)

### **Retrieval**

The vector store is our library, and the retriever is our smart librarian. It takes a user’s query, embeds it, and then fetches the most semantically similar chunks from the vector store.

![alt text](retrieval.png)

In [20]:
retriever = vectorstore.as_retriever()

In [21]:
docs = retriever.get_relevant_documents("What is Task Decomposition?")

print(docs[0].page_content)

  docs = retriever.get_relevant_documents("What is Task Decomposition?")


Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.


### Generator

LLM to read it and formulate a human-friendly answer

![alt text](generator.png)

In [22]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})]


In [23]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=1)

In [28]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()
)

In [30]:
response = rag_chain.invoke("What is Task Decomposition?")
print(response)

Task Decomposition is the process of breaking down a complicated task into smaller, manageable steps. This method often involves techniques like Chain of Thought (CoT) and Tree of Thoughts, which enhance reasoning and planning capabilities. It can be implemented through simple prompts, task-specific instructions, or external planning tools.


## Advanced Query Transformations

![](advancedquery.png)

A query might be too specific, too broad, or use different vocabulary than our source documents, leading to poor retrieval results.

**Query Transformation** is a set of powerful techniques designed to re-write, expand, or break down the original question to significantly improve retrieval accuracy.

## Multi-Query Generation

A single user query represents just one perspective. Distance-based similarity search might miss relevant documents that use synonyms or discuss related concepts.

The Multi-Query approach tackles this by using an LLM to generate several different versions of the user’s question, effectively searching from multiple angles.

![](multiquery.png)

In [31]:
from langchain.prompts import ChatPromptTemplate

template = """You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)

generate_queries = (
    prompt_perspectives | ChatOpenAI(temperature=0) | StrOutputParser() | (lambda x: x.split("\n"))
)

In [33]:
question = "What is task decomposition for LLM agents?"
generated_queries_list = generate_queries.invoke({"question": question})

for i, q in enumerate(generated_queries_list):
    print(f"{i+1}. {q}")

1. 1. How do LLM agents utilize task decomposition in their operations?
2. 2. Can you explain the concept of task decomposition as applied to LLM agents?
3. 3. In what ways do LLM agents benefit from task decomposition?
4. 4. What role does task decomposition play in the functioning of LLM agents?
5. 5. How is task decomposition integrated into the workflow of LLM agents?


Now, we can retrieve documents for all of these queries and combine the results. A simple way to combine them is to take the unique set of all retrieved documents.

In [35]:
from langchain.load import dumps, loads

def get_unique_union(documents: list[list]):
    flattened_docs = [dumps(docs) for sublist in documents for doc in sublist]
    unique_docs = list(set(flattened_docs))
    return [loads(doc) for doc in unique_docs]

retrieval_chain = generate_queries | retriever.map() | get_unique_union

docs = retrieval_chain.invoke({"question": question})

print(f"Total unique documents retrieved: {len(docs)}")

Total unique documents retrieved: 1


In [36]:
from operator import itemgetter

template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    {"context": retrieval_chain, "question": itemgetter("question")} | prompt | llm | StrOutputParser()
)

final_rag_chain.invoke({"question": question})

'Task decomposition for LLM (Large Language Model) agents involves breaking down complicated tasks into smaller, manageable steps to enhance performance and aid in planning. This can be achieved through various methods:\n\n1. **Chain of Thought (CoT)**: A prompting technique where the model is encouraged to "think step by step," allowing it to decompose hard tasks into simpler components.\n2. **Tree of Thoughts**: An extension of CoT that decomposes a problem into multiple thought steps, generating several thoughts for each step, which creates a tree structure. This approach allows for exploring multiple reasoning possibilities and can utilize search processes like breadth-first search (BFS) or depth-first search (DFS), with evaluations made by a classifier or through majority vote.\n3. **Direct Prompts**: Simple prompting types, such as asking for subgoals or outlining steps for a specific task.\n4. **Task-Specific Instructions**: Providing specific instructions tailored to the task, 

This answer is more robust because it’s based on a wider pool of relevant documents.



## RAG-Fusion

RAG-Fusion improves on Multi-Query by not just fetching documents, but also re-ranking them using a technique called Reciprocal Rank Fusion (RRF).

RRF intelligently combines results from multiple searches. It boosts the score of documents that appear consistently high across different result lists, pushing the most relevant content to the top.

![](ragfusion.png)

In [40]:
def reciprocal_rank_fusion(results: list[list], k=60):
    fused_scores = {}
    
    for docs in results:
        for rank, doc in enumerate(docs):
            doc_str = dumps(doc)
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            fused_scores[doc_str] += 1 / (rank + k)
            
    reranked_results = [
        (loads(doc), score) for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    return reranked_results

In [41]:
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

generate_queries = (
    prompt_rag_fusion 
    | ChatOpenAI(temperature=0)
    | StrOutputParser() 
    | (lambda x: x.split("\n"))
)

# Build the new retrieval chain with RRF
retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})

print(f"Total re-ranked documents retrieved: {len(docs)}")


Total re-ranked documents retrieved: 5


## Decomposition

The Decomposition technique uses an LLM to break down a complex query into a set of simpler, self-contained sub-questions. We can then answer each one and synthesize a final answer.

![](decomposition1.png)
---
![](decomposition2.png)

In [43]:
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)

generate_queries_decomposition = (
    prompt_decomposition | llm | StrOutputParser() | (lambda x: x.split("\n"))
)

question = "What are the main components of an LLM-powered autonomous agent system?"
sub_questions = generate_queries_decomposition.invoke({"question": question})
print(sub_questions)

['1. What are the core functionalities of an LLM (Large Language Model) in an autonomous agent system?', '2. How does natural language understanding contribute to the efficiency of LLM-powered autonomous agents?', '3. What role does reinforcement learning play in the development of LLM-powered autonomous agent systems?']


In [44]:
prompt_rag = hub.pull("rlm/rag-prompt")

rag_results = []
for sub_question in sub_questions:
    retrieved_docs = retriever.get_relevant_documents(sub_question)
    answer = (prompt_rag|llm|StrOutputParser()).invoke({"context": retrieved_docs, "question": sub_question})
    rag_results.append(answer)

In [45]:
def format_qa_pairs(questions, answers):
    formatted_string = ""
    for i, (question, answer) in enumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question} \n Answer {i}: {answer} \n\n"
    return formatted_string.strip()

In [46]:
context = format_qa_pairs(sub_questions, rag_results)

# Final synthesis prompt
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the original question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context": context, "question": question})


'An LLM-powered autonomous agent system is composed of several key components that work together to enhance its functionality and effectiveness. These components include:\n\n1. **Large Language Model (LLM)**: The LLM serves as the central processing unit, functioning like the “brain” of the agent. It performs tasks such as planning, where complex tasks are broken down into manageable subgoals, and reflection, allowing the agent to learn from past actions and improve its future performance.\n\n2. **Natural Language Understanding (NLU)**: This component enables effective communication between the LLM and external systems, such as memory and tools. Through NLU, the agent can comprehend and interpret natural language, allowing it to navigate complex tasks and self-reflect on previous actions for enhanced outcomes.\n\n3. **Memory Capabilities**: A critical aspect of the system, memory allows the agent to retain both short-term and long-term information. This retention is essential for infor

## Step-Back Prompting

Sometimes, a user’s query is too specific, while our documents contain the more general, underlying information needed to answer it.

![](stepback.png)

The Step-Back technique uses an LLM to take a “step back” and form a more general question. We then retrieve context for both the specific and general questions, providing a richer context for the final answer.

We can teach the LLM this pattern using few-shot examples.

In [50]:
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

examples = [
    {
        "input": "Could the members of The Police perform lawful arrests?",
        "output": "what can the members of The Police do?",
    },
    {
        "input": "Jan Sindel's was born in what country?",
        "output": "what is Jan Sindel's personal history?",
    },
]


example_prompt = ChatPromptTemplate.from_messages([
    ("human", "{input}"),
    ("ai"), "{output}"
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

prompt = ChatPromptTemplate.from_messages([
    ("system", 
    "You are an expert at world knowledge. Your task is to step back and paraphrase a question "
    "to a more generic step-back question, which is easier to answer. Here are a few examples:"),
    few_shot_prompt,
    ("user", "{question}"),
])

In [51]:
# Define a chain to generate step-back questions using the prompt and an OpenAI model
generate_queries_step_back = prompt | ChatOpenAI(temperature=0) | StrOutputParser()

# Run the chain on a specific question
question = "What is task decomposition for LLM agents?"
step_back_question = generate_queries_step_back.invoke({"question": question})

# Output the original and generated step-back question
print(f"Original Question: {question}")
print(f"Step-Back Question: {step_back_question}")

Original Question: What is task decomposition for LLM agents?
Step-Back Question: What are the steps involved in breaking down tasks for LLM agents?


In [52]:
from langchain_core.runnables import RunnableLambda

# Prompt for the final response
response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

# Normal Context
{normal_context}

# Step-Back Context
{step_back_context}

# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

# The full chain
chain = (
    {
        # Retrieve context using the normal question
        "normal_context": RunnableLambda(lambda x: x["question"]) | retriever,
        # Retrieve context using the step-back question
        "step_back_context": generate_queries_step_back | retriever,
        # Pass on the original question
        "question": lambda x: x["question"],
    }
    | response_prompt
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
)

chain.invoke({"question": question})

'Task decomposition for LLM agents refers to the process of breaking down complex tasks into smaller, more manageable subgoals or steps that can be easily handled by the agent. This decomposition can be achieved through various methods, such as simple prompting, task-specific instructions, or human inputs.\n\nOne approach to task decomposition involves using simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" to guide the LLM in breaking down the task into smaller components. Another method is to provide task-specific instructions, such as asking the agent to "Write a story outline" for writing a novel. Additionally, human inputs can also be used to decompose tasks for LLM agents.\n\nIn some cases, a more advanced approach known as LLM+P involves relying on an external classical planner to perform long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. The L

## HyDE

**HyDE (Hypothetical Document Embeddings)** proposes a radical solution: First, have an LLM generate a hypothetical answer to the question. This fake document, while not factually correct, will be semantically rich and use the kind of language we expect to find in a real answer.

We then embed this hypothetical document and use its embedding to perform the retrieval. The result is that we find real documents that are semantically very similar to an ideal answer.

![](hyde.png)

In [56]:
template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)

generate_docs_for_retrieval = (
    prompt_hyde | llm | StrOutputParser()
)

hypothetical_document = generate_docs_for_retrieval.invoke({"question": question})

print(hypothetical_document)

**Task Decomposition for LLM Agents: An Overview**

Task decomposition is a critical process in the effective utilization of Large Language Model (LLM) agents, which refers to the systematic breakdown of a complex task into smaller, manageable subtasks. This approach is essential in enhancing the efficiency and performance of LLM agents when tackling multifaceted problems that require sequential reasoning and problem-solving abilities.

In the context of LLM agents, task decomposition often involves identifying key components of a larger objective, categorizing them into discrete actions or steps, and establishing relationships between these components. For instance, when faced with the task of generating a comprehensive report, the LLM agent can decompose this task into several smaller subtasks: conducting research, outlining the report structure, drafting individual sections, and finally, revising the content for clarity and coherence.

The methodology of task decomposition includes 

In [57]:
retrieval_chain = generate_docs_for_retrieval | retriever
retriever_docs = retrieval_chain.invoke({"question": question})
final_rag_chain.invoke({"context": retrieved_docs, "question": question})

'Task decomposition for LLM (large language model) agents refers to the process by which an agent breaks down large, complex tasks into smaller, more manageable subgoals. This allows the agent to handle intricate tasks more efficiently and effectively. By decomposing tasks, the LLM agent can focus on one subgoal at a time, making it easier to plan and execute actions toward achieving the overall objective.\n\nThis process also incorporates elements of self-reflection and refinement, where the agent can evaluate past actions and learn from mistakes, thus improving future performance. The ability to decompose tasks is essential for the agent to optimize its problem-solving capabilities and navigate real-world challenges where iterative improvement is often necessary.'

## Routing & Query Construction

We often have multiple data sources: documentation for different programming languages, internal wikis, public websites, or databases with structured metadata.

![](routing.png)

This is where our RAG system needs to evolve from a simple librarian into an intelligent switchboard operator. It needs the ability to first analyze an incoming query and then route it to the correct destination or construct a more precise, structured query for retrieval. This section dives into the techniques that make this possible.

### Logical Routing

Routing is a classification problem. Given a user’s question, we need to classify it into one of several predefined categories. While traditional ML models can do this, we can leverage the powerful reasoning engine we already have: the LLM itself.

![](logical.png)

By providing the LLM with a clear schema (a set of possible categories), we can ask it to make the classification decision for us.

In [58]:
from typing import Literal
from langchain_core.pydantic_v1 import BaseModel, Field

class RouteQuery(BaseModel):
    datasource: Literal["python_docs", "js_docs", "golang_docs"] = Field(
        ..., 
        description="Given a user question, choose which datasource would be most relevant for answering their question.",
    )

In [59]:
structured_llm = llm.with_structured_output(RouteQuery)
system = """ou are an expert at routing a user question to the appropriate data source.
Based on the programming language the question is referring to, route it to the relevant data source."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}")
    ]
)
router = prompt | structured_llm



In [60]:
question = """Why doesn't the following code work:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(["human", "speak in {language}"])
prompt.invoke("french")
"""

# Invoke the router and check the result
result = router.invoke({"question": question})

print(result)

datasource='python_docs'


In [61]:
def choose_route(result):
    if "python_docs" in result.datasource.lower():
        return "chain for python_docs"
    elif "js_docs" in result.datasource.lower():
        return "chain for js_docs"
    else:
        return "chin for golang_docs"

full_chain = router | RunnableLambda(choose_route)

final_destination = full_chain.invoke({"question": question})

print(final_destination)

chain for python_docs


### Semantic Routing

Logical routing works perfectly when you have clearly defined categories. But what if you want to route based on the style or domain of a question? For example, you might want to answer physics questions with a serious, academic tone and math questions with a step-by-step, pedagogical approach. This is where Semantic Routing comes in.

![](semanticrouting.png)

`Instead of classifying the query, we define multiple expert prompts.`

We then embed the user’s query and each of our prompt templates, and use cosine similarity to find the prompt that is most semantically aligned with the query.

In [62]:
from langchain_core.prompts import PromptTemplate

# A prompt for a physics expert
physics_template = """You are a very smart physics professor. \
You are great at answering questions about physics in a concise and easy to understand manner. \
When you don't know the answer to a question you admit that you don't know.

Here is a question:
{query}"""

# A prompt for a math expert
math_template = """You are a very good mathematician. You are great at answering math questions. \
You are so good because you are able to break down hard problems into their component parts, \
answer the component parts, and then put them together to answer the broader question.

Here is a question:
{query}"""

In [73]:
from langchain.utils.math import cosine_similarity

embeddings = OpenAIEmbeddings()

prompt_templates = [physics_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)

def prompt_router(input):
    query_embedding = embeddings.embed_query(input["query"])
    print("Query Embedding: "+ str(query_embedding))
    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
    print(cosine_similarity([query_embedding], prompt_embeddings))
    print("Similarity: " + str(similarity))
    most_similar_index = similarity.argmax()
    chosen_prompt = prompt_templates[most_similar_index]
    print(f"DEBUG: Using {'MATH' if most_similar_index == 1 else 'PHYSICS'} template.")
    return PromptTemplate.from_template(chosen_prompt)

In [74]:
# The final chain that combines the router with the LLM
chain = (
    {"query": RunnablePassthrough()}
    | RunnableLambda(prompt_router)  # Dynamically select the prompt
    | ChatOpenAI()
    | StrOutputParser()
)

# Ask a physics question
print(chain.invoke("What's a black hole"))

Query Embedding: [-0.01060719694942236, -0.016601404175162315, -0.004036366939544678, 0.0006725881830789149, -0.039719998836517334, -0.0138792023062706, -0.011069837026298046, -0.028589816763997078, -0.002549549099057913, -0.035643402487039566, 0.04722951725125313, -0.0029116154182702303, -0.002331638941541314, -0.00944053940474987, -0.0017248429358005524, 0.018746981397271156, 0.03481198847293854, -0.020490262657403946, 0.007898406125605106, -0.01298744697123766, -0.025653056800365448, 0.019484523683786392, 0.0027138199657201767, -0.008233652450144291, -0.012471167370676994, 0.00789170153439045, 0.007529634982347488, -0.01356407068669796, -0.012129216454923153, -0.010560262948274612, 0.01473743375390768, -0.011157001368701458, -0.019927049055695534, -0.02617604099214077, -0.028267979621887207, 0.021630100905895233, -0.009916589595377445, -0.01483130268752575, 0.002247827360406518, 0.016829371452331543, 0.010399344377219677, 0.019980687648057938, -0.008173308335244656, -0.0009663478704

### Query Structuring

So far, we’ve focused on retrieving from unstructured text. But most real-world data is semi-structured; it contains valuable metadata like dates, authors, view counts, or categories. A simple vector search can’t leverage this information.

`Query Structuring is the technique of converting a natural language question into a structured query that can use these metadata filters for highly precise retrieval.`

In [None]:
from langchain_community.document_loaders import YoutubeLoader

docs = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ", add_video_info=True
).load()


print(docs[0].metadata)

In [82]:
import datetime
from typing import Optional

class TutorialSearch(BaseModel):
    """A data model for searching over a database of tutorial videos."""

    # The main query for a similarity search over the video's transcript.
    content_search: str = Field(..., description="Similarity search query applied to video transcripts.")
    
    # A more succinct query for searching just the video's title.
    title_search: str = Field(..., description="Alternate version of the content search query to apply to video titles.")
    
    # Optional metadata filters
    min_view_count: Optional[int] = Field(None, description="Minimum view count filter, inclusive.")
    max_view_count: Optional[int] = Field(None, description="Maximum view count filter, exclusive.")
    earliest_publish_date: Optional[datetime.date] = Field(None, description="Earliest publish date filter, inclusive.")
    latest_publish_date: Optional[datetime.date] = Field(None, description="Latest publish date filter, exclusive.")
    min_length_sec: Optional[int] = Field(None, description="Minimum video length in seconds, inclusive.")
    max_length_sec: Optional[int] = Field(None, description="Maximum video length in seconds, exclusive.")

    def pretty_print(self) -> None:
        """A helper function to print the populated fields of the model."""
        for field in self.__fields__:
            if getattr(self, field) is not None:
                print(f"{field}: {getattr(self, field)}")

In [83]:
# System prompt for the query analyzer
system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a database query optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
structured_llm = llm.with_structured_output(TutorialSearch)

# The final query analyzer chain
query_analyzer = prompt | structured_llm



In [85]:
# Test 1: A simple query
query_analyzer.invoke({"question": "rag from scratch"}).pretty_print()

content_search: rag from scratch
title_search: rag from scratch


In [86]:
query_analyzer.invoke(
    {"question": "videos on chat langchain published in 2023"}
).pretty_print()

content_search: chat langchain
title_search: chat langchain
earliest_publish_date: 2023-01-01
latest_publish_date: 2024-01-01


In [87]:
query_analyzer.invoke(
    {
        "question": "how to use multi-modal models in an agent, only videos under 5 minutes"
    }
).pretty_print()

content_search: multi-modal models in an agent
title_search: multi-modal models in an agent
max_length_sec: 300


## Advanced Indexing Strategies
So far, our approach to indexing has been straightforward: split documents into chunks and embed them. This works, but it has a fundamental limitation.

Small, focused chunks are great for retrieval accuracy (they contain less noise), but they often lack the broader context needed for the LLM to generate a comprehensive answer.

![](advanced_indexiing.png)

Conversely, large chunks provide great context but perform poorly in retrieval because their core meaning gets diluted.

`This is the classic “chunk size” dilemma.`

### Multi-Representation Indexing

The core idea of Multi-Representation Indexing is simple but powerful: instead of embedding the full document chunks, we create a smaller, more focused representation of each chunk (like a summary) and embed that instead.

![](multi-representation.png)

During retrieval, we search over these concise summaries. Once we find the best summary, we use its ID to look up and retrieve the full, original document chunk.

This way, we get the precision of searching over small, dense summaries and the rich context of the larger parent documents for generation.

In [88]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

print(f"Loaded {len(docs)} documents.")

Loaded 2 documents.


Next, we’ll create a chain to generate a summary for each of these documents.

In [91]:
import uuid

summary_chain = (
    # Extract the page content from the document object
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

summaries = summary_chain.batch(docs, {"max_concurrency": 5}) # Use .batch() to run the summarization in parallel for efficiency

print(summaries[0])

The document "LLM Powered Autonomous Agents" by Lilian Weng provides an in-depth exploration of autonomous agents powered by large language models (LLMs). It outlines key components of such systems, including planning, memory, and tool use:

1. **Agent System Overview**: LLMs are portrayed as the brain of autonomous agents, enabling them to perform complex tasks through various functionalities.

2. **Planning**: This involves breaking down large tasks into manageable subtasks (task decomposition) and allowing the agent to self-reflect and learn from past actions to improve future performance.

3. **Memory**: The document describes two types of memory—short-term (in-context learning) and long-term (stored information retrieved via vector databases). Maximum Inner Product Search (MIPS) serves as a method for efficient retrieval of stored data.

4. **Tool Use**: It highlights the ability of LLMs to interact with external APIs and tools to extend their capabilities, integrating modules tha

Now comes the crucial part. We need a `MultiVectorRetriever` which requires two main components:

A `vectorstore` to store the embeddings of our summaries.
A `docstore` (a simple key-value store) to hold the original, full documents.

In [93]:
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.documents import Document

# The vectorstore to index the summary embeddings
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id" # This key will link summaries to their parent documents

# The retriever that orchestrates the whole process
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Generate unique IDs for each of our original documents
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Create new Document objects for the summaries, adding the 'doc_id' to their metadata
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add the summaries to the vectorstore
retriever.vectorstore.add_documents(summary_docs)

# Add the original documents to the docstore, linking them by the same IDs
retriever.docstore.mset(list(zip(doc_ids, docs)))

  vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())


In [94]:
query = "Memory in agents"

# First, let's see what the vectorstore finds by searching the summaries
sub_docs = vectorstore.similarity_search(query, k=1)
print("--- Result from searching summaries ---")
print(sub_docs[0].page_content)
print("\n--- Metadata showing the link to the parent document ---")
print(sub_docs[0].metadata)

--- Result from searching summaries ---
The document "LLM Powered Autonomous Agents" by Lilian Weng provides an in-depth exploration of autonomous agents powered by large language models (LLMs). It outlines key components of such systems, including planning, memory, and tool use:

1. **Agent System Overview**: LLMs are portrayed as the brain of autonomous agents, enabling them to perform complex tasks through various functionalities.

2. **Planning**: This involves breaking down large tasks into manageable subtasks (task decomposition) and allowing the agent to self-reflect and learn from past actions to improve future performance.

3. **Memory**: The document describes two types of memory—short-term (in-context learning) and long-term (stored information retrieved via vector databases). Maximum Inner Product Search (MIPS) serves as a method for efficient retrieval of stored data.

4. **Tool Use**: It highlights the ability of LLMs to interact with external APIs and tools to extend the

In [95]:
# Let the full retriever do its job
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)

# Print the beginning of the retrieved full document
print("\n--- The full document retrieved by the MultiVectorRetriever ---")
print(retrieved_docs[0].page_content[0:500])


--- The full document retrieved by the MultiVectorRetriever ---






LLM Powered Autonomous Agents | Lil'Log







































Lil'Log

















|






Posts




Archive




Search




Tags




FAQ









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three:


### **Hierarchical Indexing (RAPTOR) Knowledge Tree**

**The Theory:** RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) takes the multi-representation idea a step further. Instead of just one layer of summaries, RAPTOR builds a multi-level tree of summaries. It starts by clustering small document chunks. It then summarizes each cluster.

![](raptor.png)

Then, it takes these summaries, clusters them, and summarizes the new clusters. This process repeats, creating a hierarchy of knowledge from fine-grained details to high-level concepts. When you query, you can search at different levels of this tree, allowing for retrieval that can be as specific or as general as needed.

This is a more advanced technique.

Implementation: https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb 

Paper: https://arxiv.org/pdf/2401.18059

### **Token-Level Precision (ColBERT)**

**The Theory:** Standard embedding models create a single vector for an entire chunk of text (this is called a “bag-of-words” approach). This can lose a lot of nuance.

![](colbert.png)

`ColBERT (Contextualized Late Interaction over BERT) offers a more granular approach. It generates a separate, context-aware embedding for every single token in the document.`

When you make a query, ColBERT also embeds every token in your query. Then, instead of comparing one document vector to one query vector, it finds the maximum similarity between each query token and any document token.

This “late interaction” allows for a much finer-grained understanding of relevance, excelling at keyword-style searches.

### ColBERT Late Interaction

**ColBERT** makes search smarter by breaking your query and each document into their individual words (tokens), then giving each word its own vector (embedding) based on context.[1][4]

Instead of checking if the entire query is similar to the entire document (like most models do), ColBERT matches **each word in your query** to **every word in the document**. For each query word, it finds the one document word that is most similar, keeping the highest score—this is the 'late interaction' step.[4][1]

**Why is this better?**
- ColBERT can spot relevant details, even if words are phrased differently in document and query.
- It does a keyword-style search, but with context-aware vectors, so matches are smarter and more precise.[3]
- It adds up the best matches for every query word, giving a fine-grained relevance score for each document.[4]

**Summary:**
- Every word in your query is compared to every word in each document.
- ColBERT finds the best matching word for each query word ("max similarity").
- These scores are summed up for a detailed relevance score.
- This lets ColBERT find relevant results even for tricky, keyword-heavy searches.

If you're used to models that give each document a single vector, ColBERT goes much deeper by making lots of smaller, meaningful comparisons.


In [None]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

In [None]:
import requests

def get_wikipedia_page(title: str):
    """A helper function to retrieve content from Wikipedia."""
    # Wikipedia API endpoint and parameters
    URL = "https://en.wikipedia.org/w/api.php"
    params = { "action": "query", "format": "json", "titles": title, "prop": "extracts", "explaintext": True }
    headers = {"User-Agent": "MyRAGApp/1.0"}
    response = requests.get(URL, params=params, headers=headers)
    data = response.json()
    page = next(iter(data["query"]["pages"].values()))
    return page.get("extract")

full_document = get_wikipedia_page("Hayao_Miyazaki")

# Index the document with RAGatouille. It handles the chunking and token-level embedding internally.
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-ColBERT",
    max_document_length=180,
    split_documents=True,
)

In [None]:
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
print(results)

In [None]:
# Convert the RAGatouille model into a LangChain-compatible retriever
colbert_retriever = RAG.as_langchain_retriever(k=3)

# Use it like any other retriever
retrieved_docs = colbert_retriever.invoke("What animation studio did Miyazaki found?")
print(retrieved_docs[0].page_content)

## Advanced Retrieval & Generation

![](advanced-retrieval-generation.png)

### Dedicated Re-ranking

Standard retrieval methods give us a ranked list of documents, but this initial ranking isn’t always perfect. Re-ranking is a crucial second-pass step where we take the initial set of retrieved documents and use a more sophisticated (and often more expensive) model to re-order them based on their relevance to the query.

![](reranking.png)

`This ensures that the most relevant documents are placed at the very top of the context we provide to the LLM.`

Now, we introduce the `ContextualCompressionRetriever`. This special retriever wraps our base retriever and adds a "compressor" step. Here, our compressor will be the `CohereRerank` model.

It will take the 10 documents from our base retriever and re-order them, returning only the most relevant ones.

In [111]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

# 1. Load documents
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",))
blog_docs = loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, chunk_overlap=50
)
splits = text_splitter.split_documents(blog_docs)

# 3. Create vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# 4. Add reranker
compressor = CohereRerank(model="rerank-english-v3.0")  # use v3.0 or v3.5
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# 5. Ask a question
question = "What is task decomposition for LLM agents?"
compressed_docs = compression_retriever.get_relevant_documents(question)

# 6. Print results
print("--- Re-ranked and Compressed Documents ---")
for doc in compressed_docs:
    print(f"Relevance Score: {doc.metadata.get('relevance_score', 0):.4f}")
    print(f"Content: {doc.page_content[:150]}...\n")


--- Re-ranked and Compressed Documents ---
Relevance Score: 0.9988
Content: Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chai...

Relevance Score: 0.9988
Content: Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chai...

Relevance Score: 0.9988
Content: Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chai...



## Self-Correction using AI Agents

What if our RAG system could check its own work before giving an answer? That’s the idea behind self-correcting RAG architectures like `CRAG (Corrective RAG)` and `Self-RAG`.

![](self-correcting.png)

These aren’t just simple chains, they are dynamic graphs (often built with LangGraph) that can reason about the quality of retrieved information and decide on a course of action.

* **CRAG:** If the retrieved documents are irrelevant or ambiguous for a given query, a CRAG system won’t just pass them to the LLM. Instead, it triggers a new, more robust web search to find better information, corrects the retrieved documents, and then proceeds with generation.
* **Self-RAG:** This approach takes it a step further. At each step, it uses an LLM to generate “reflection tokens” that critique the process. It grades the retrieved documents for relevance. If they’re not relevant, it retrieves again. Once it has good documents, it generates an answer and then grades that answer for factual consistency, ensuring it’s grounded in the source documents.

These techniques represent the state-of-the-art in building reliable, production-grade RAG. Implementing them from scratch involves building a state machine or graph.

## Impact of Long Context

![](impact-long-context.png)

## Manual RAG Evaluation

1. **Faithfulness:** Does the answer stick strictly to the provided context? A faithful answer does not invent information or use the LLM’s pre-trained knowledge to answer. This is the single most important metric for preventing hallucinations.
2. **Correctness:** Is the answer factually correct when compared to a “ground truth” or reference answer?
3. **Contextual** Relevancy: Was the context we retrieved actually relevant to the user’s question? This evaluates the performance of our retriever, not the generator.


### Building Evaluators from Scratch with LangChain

In [112]:
from langchain.prompts import PromptTemplate

# We'll use a powerful LLM like gpt-4o to act as our "judge" for reliable evaluation.
llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)

# Define the output schema for our evaluation score to ensure consistent, structured output.
class ResultScore(BaseModel):
    score: float = Field(..., description="The score of the result, ranging from 0 to 1 where 1 is the best possible score.")

# This prompt template clearly instructs the LLM on how to score the answer's correctness.
correctness_prompt = PromptTemplate(
    input_variables=["question", "ground_truth", "generated_answer"],
    template="""
    Question: {question}
    Ground Truth: {ground_truth}
    Generated Answer: {generated_answer}

    Evaluate the correctness of the generated answer compared to the ground truth.
    Score from 0 to 1, where 1 is perfectly correct and 0 is completely incorrect.
    
    Score:
    """
)

# We build the evaluation chain by piping the prompt to the LLM with structured output.
correctness_chain = correctness_prompt | llm.with_structured_output(ResultScore)




In [113]:
def evaluate_correctness(question, ground_truth, generated_answer):
    """A helper function to run our custom correctness evaluation chain."""
    result = correctness_chain.invoke({
        "question": question, 
        "ground_truth": ground_truth, 
        "generated_answer": generated_answer
    })
    return result.score

# Test the correctness chain with a partially correct answer.
question = "What is the capital of France and Spain?"
ground_truth = "Paris and Madrid"
generated_answer = "Paris"
score = evaluate_correctness(question, ground_truth, generated_answer)

print(f"Correctness Score: {score}")

Correctness Score: 0.5


Next, let’s build an evaluator for `Faithfulness`. This is arguably more important than correctness for RAG, as it’s our primary defense against hallucination.

In [114]:
# The prompt template for faithfulness includes several examples (few-shot prompting)
# to make the instructions to the judge LLM crystal clear.
faithfulness_prompt = PromptTemplate(
    input_variables=["question","context", "generated_answer"],
    template="""
    Question: {question}
    Context: {context}
    Generated Answer: {generated_answer}

    Evaluate if the generated answer to the question can be deduced from the context.
    Score of 0 or 1, where 1 is perfectly faithful *AND CAN BE DERIVED FROM THE CONTEXT* and 0 otherwise.
    You don't mind if the answer is correct; all you care about is if the answer can be deduced from the context.
    
    [... a few examples from the notebook to guide the LLM ...]

    Example:
    Question: What is 2+2?
    Context: 4.
    Generated Answer: 4.
    In this case, the context states '4', but it does not provide information to deduce the answer to 'What is 2+2?', so the score should be 0.
    """
)

# Build the faithfulness chain using the same structured LLM.
faithfulness_chain = faithfulness_prompt | llm.with_structured_output(ResultScore)




In [115]:
def evaluate_faithfulness(question, context, generated_answer):
    """A helper function to run our custom faithfulness evaluation chain."""
    result = faithfulness_chain.invoke({
        "question": question, 
        "context": context, 
        "generated_answer": generated_answer
    })
    return result.score

# Test the faithfulness chain. The answer is correct, but is it faithful?
question = "what is 3+3?"
context = "6"
generated_answer = "6"
score = evaluate_faithfulness(question, context, generated_answer)

print(f"Faithfulness Score: {score}")

Faithfulness Score: 0.0


### Evaluation with Frameworks

![](evals.png)

### **Rapid Evaluation with `deepeval`**

In [126]:
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Test case 1 - correctness
test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output="Madrid is the capital of Spain.",
    actual_output="MadriD.",
    retrieval_context=["Madrid is the capital of Spain."]  # needed for Faithfulness
)

# Test case 2 - faithfulness (added expected_output so GEval won't break)
test_case_faithfulness = LLMTestCase(
    input="what is 3+3?",
    expected_output="6",     # ✅ added so GEval has ground truth
    actual_output="6",
    retrieval_context=["6"]
)

# Run evaluation
evaluation_results = evaluate(
    test_cases=[test_case_correctness, test_case_faithfulness],
    metrics=[
        GEval(
            name="Correctness",
            model="gpt-4o",
            criteria="correctness",
            evaluation_params=[
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.EXPECTED_OUTPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
            ],
        ),
        FaithfulnessMetric(),
    ],
)

print(evaluation_results)


[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30m----------------------------[0m [96m  0%[0m [34m0:00:00[0m
[2K[1A[2KEvaluating 2 test case(s) in parallel [30

**Another Powerful Alternative with `grouse`**

`grouse` is another excellent open-source option, offering a similar suite of metrics but with a unique focus on allowing deep customization of the "judge" prompts. This is useful for fine-tuning evaluation criteria for a specific domain.

In [128]:
from grouse import EvaluationSample, GroundedQAEvaluator

evaluator = GroundedQAEvaluator()

unfaithful_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is located at Rue Rabelais in Paris.",
    expected_output="The Eiffel Tower is located on the Champ de Mars in Paris, France.",  # ✅ required
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "Gustave Eiffel died in his appartment at Rue Rabelais in Paris."
    ]
)

result = evaluator.evaluate(eval_samples=[unfaithful_sample]).evaluations[0]

print("Grouse Faithfulness Score (0 or 1):", result.faithfulness.faithfulness)


100%|██████████| 1/1 [00:24<00:00, 24.59s/it]

2025-10-03 14:32:00,172 - LLM Call Tracker - INFO - Cost: 0.1039$
2025-10-03 14:32:00,172 - LLM Call Tracker - INFO - Cost: 0.1039$
Grouse Faithfulness Score (0 or 1): 0



  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


**Evaluation with `RAGAS`**
While deepeval and grouse are great general-purpose evaluators, `RAGAS (Retrieval-Augmented Generation Assessment)` is a framework built specifically for evaluating RAG pipelines. It provides a comprehensive suite of metrics that measure every component of your system, from retriever to generator.

It requires four key pieces of information for each test case:

* **question:** The user's input query.
* **answer:** The final answer generated by our RAG system.
* **contexts:** The list of documents retrieved by our retriever.
* **ground_truth:** The correct, reference answer.

In [129]:
# 1. Prepare the evaluation data
questions = [
    "What is the name of the three-headed dog guarding the Sorcerer's Stone?",
    "Who gave Harry Potter his first broomstick?",
    "Which house did the Sorting Hat initially consider for Harry?",
]

# These would be the answers generated by our RAG pipeline
generated_answers = [
    "The three-headed dog is named Fluffy.",
    "Professor McGonagall gave Harry his first broomstick, a Nimbus 2000.",
    "The Sorting Hat strongly considered putting Harry in Slytherin.",
]

# The ground truth, or "perfect" answers
ground_truth_answers = [
    "Fluffy",
    "Professor McGonagall",
    "Slytherin",
]

# The context retrieved by our RAG system for each question
retrieved_documents = [
    ["A massive, three-headed dog was guarding a trapdoor. Hagrid mentioned its name was Fluffy."],
    ["First years are not allowed brooms, but Professor McGonagall, head of Gryffindor, made an exception for Harry."],
    ["The Sorting Hat muttered in Harry's ear, 'You could be great, you know, it's all here in your head, and Slytherin will help you on the way to greatness...'"],
]

In [132]:
from datasets import Dataset

# 2. Structure the data into a Hugging Face Dataset object
data_samples = {
    'question': questions,
    'answer': generated_answers,
    'contexts': retrieved_documents,
    'ground_truth': ground_truth_answers
}

dataset = Dataset.from_dict(data_samples)

In [133]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    answer_correctness,
)

# 3. Define the metrics we want to use for evaluation
metrics = [
    faithfulness,       # How factually consistent is the answer with the context? (Prevents hallucination)
    answer_relevancy,   # How relevant is the answer to the question?
    context_recall,     # Did we retrieve all the necessary context to answer the question?
    answer_correctness, # How accurate is the answer compared to the ground truth?
]

# 4. Run the evaluation
result = evaluate(
    dataset=dataset, 
    metrics=metrics
)

# 5. Display the results in a clean table format
results_df = result.to_pandas()
print(results_df)

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating: 100%|██████████| 12/12 [00:29<00:00,  2.45s/it]


                                          user_input  \
0  What is the name of the three-headed dog guard...   
1        Who gave Harry Potter his first broomstick?   
2  Which house did the Sorting Hat initially cons...   

                                  retrieved_contexts  \
0  [A massive, three-headed dog was guarding a tr...   
1  [First years are not allowed brooms, but Profe...   
2  [The Sorting Hat muttered in Harry's ear, 'You...   

                                            response             reference  \
0              The three-headed dog is named Fluffy.                Fluffy   
1  Professor McGonagall gave Harry his first broo...  Professor McGonagall   
2  The Sorting Hat strongly considered putting Ha...             Slytherin   

   faithfulness  answer_relevancy  context_recall  answer_correctness  
0           1.0          0.942253             1.0            0.970923  
1           0.0          0.988045             1.0            0.719327  
2           0.0      