# Evaluating RAG Architectures on Benchmark Tasks


#### Introduction

If you ever wanted to compare different approaches to Q&A over docs, you'll find this notebook helpful to get started evaluating different configurations and common RAG architectures on benchmark tasks. The goal is to make it easy for you to experiment with different techniques, understand their tradeoffs, and make informed decisions for your specific use case.

#### What is RAG?

LLMs have a knowledge cutoff. For them to accurately respond to user queries, they need access to relevant information. Retrieval Augmented Generation (RAG) (aka "give an LLM a search engine") is a common design pattern to address this. The key components are:

- Retriever: fetches information from a knowledge base, which can be a vector search engine, a database, or any search engine.
- Generator: synthesizes responses using a blend of learned knowledge and the retrieved information.

The overall quality of the system depends on both components.


#### Benchmark Tasks and Datasets (As of 2023/11/21)

The following datasets are currently available:

- LangChain Docs Q&A - technical questions based on the LangChain python documentation
- Semi-structured Earnings - financial questions and answers on financial PDFs containing tables and graphs

Each task comes with a labeled dataset of questions and answers. They also provide configurable factory functions for easy customization of chunking and indexing for the relevant source documents.

And with that, let's get started!

## Pre-requisites

We will install quite a few prerequisites for this example since we are comparing many techniques and models.

We will be using LangSmith to capture the evaluation traces. You can make a free account at [smith.langchain.com](https://smith.langchain.com/). Once you've done so, you can make an API key and set it below.

We are comparing many methods throughout this notebook, so the list of dependencies we will install is long.

In [1]:
%pip install -U --quiet langchain langsmith langchainhub langchain_benchmarks
%pip install --quiet chromadb openai huggingface pandas langchain_experimental sentence_transformers pyarrow anthropic tiktoken

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# os.environ["ANTHROPIC_API_KEY"] = "sk-..."  # Your Anthropic API key
# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
import uuid

# Generate a unique run ID for these experiments
run_uid = uuid.uuid4().hex[:6]

## Review Q&A tasks

The registry provides configurations to test out common architectures on curated datasets.
Below is a list of the available tasks at the time of writing.

In [3]:
from langchain_benchmarks import clone_public_dataset, registry

In [4]:
registry.filter(Type="RetrievalTask")

Name,Type,Dataset ID,Description
LangChain Docs Q&A,RetrievalTask,452ccafc-18e1-4314-885b-edd735f17b9d,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Reports,RetrievalTask,c47d9617-ab99-4d6e-a6e6-92b8daf85a7d,Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Multi-modal slide decks,RetrievalTask,40afc8e7-9d7e-44ed-8971-2cae1eb59731,This public dataset is a work-in-progress and will be extended over time.  Questions and answers based on slide decks containing visual tables and charts. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer.


In [5]:
langchain_docs = registry["LangChain Docs Q&A"]
langchain_docs

0,1
Name,LangChain Docs Q&A
Type,RetrievalTask
Dataset ID,452ccafc-18e1-4314-885b-edd735f17b9d
Description,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Retriever Factories,"basic, parent-doc, hyde"
Architecture Factories,conversational-retrieval-qa
get_docs,


In [6]:
clone_public_dataset(langchain_docs.dataset_id, dataset_name=langchain_docs.name)

Dataset LangChain Docs Q&A already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/2586a6b8-a802-5f6f-b08e-ef250f997c21/datasets/1013d34f-58c9-44f4-974b-69d7c9c6b90d.


## Basic Vector Retrieval

For our first example, we will generate a single embedding for each document in the dataset,
without chunking or indexing, and then provide that retriever to an LLM for inference.

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
retriever_factory = langchain_docs.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
# Note that this does not apply any chunking to the docs,
# which means the documents can be of arbitrary length
retriever = retriever_factory(embeddings)

0it [00:00, ?it/s]

In [7]:
# Factory for creating a conversational retrieval QA chain

chain_factory = langchain_docs.architecture_factories["conversational-retrieval-qa"]

In [8]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

chain_factory(retriever, llm=llm).invoke({"question": "what's lcel?"})

'- **LangChain Expression Language (LCEL)** is a declarative framework for composing chains effortlessly in LangChain.\n- It allows for seamless production deployment without code changes, supporting both synchronous and asynchronous operations.\n- Key features include streaming support, optimized parallel execution, retries, and access to intermediate results.\n- LCEL provides built-in input and output schemas for validation and integrates with LangSmith for observability and debugging purposes [1][2].'

In [9]:
from functools import partial
from langsmith.client import Client
from langchain_benchmarks.rag import get_eval_config

client = Client()
RAG_EVALUATION = get_eval_config()

test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    project_name=f"gpt-4o-mini qa-chain simple-index {run_uid}",
    project_metadata={
        "index_method": "basic",
        "embedding_model": "text-embedding-3-small",
        "llm": "gpt-4o-mini",
    },
    verbose=True,
)

View the evaluation results for project 'gpt-4o-mini qa-chain simple-index c7f362' at:
https://smith.langchain.com/o/2586a6b8-a802-5f6f-b08e-ef250f997c21/datasets/1013d34f-58c9-44f4-974b-69d7c9c6b90d/compare?selectedSessions=58552e7a-f159-4b23-b7c5-5b0064f6d862

View all tests for Dataset LangChain Docs Q&A at:
https://smith.langchain.com/o/2586a6b8-a802-5f6f-b08e-ef250f997c21/datasets/1013d34f-58c9-44f4-974b-69d7c9c6b90d
[------------------>                               ] 32/86

Chain failed for example 889610d7-372b-41e2-8e41-94610316b1d5 with inputs {'question': 'What does ReAct mean?'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 54438, Requested 8433. Please try again in 2.871s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[--------------------------->                      ] 48/86

Chain failed for example a5d83982-2a01-4f85-9413-2c8e007f7a10 with inputs {'question': 'What is a chain?'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 50718, Requested 25391. Please try again in 16.109s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[--------------------------->                      ] 49/86

Chain failed for example c54c777a-effb-450e-b918-2d0c8c8c94b4 with inputs {'question': 'how do I search and filter metadata in redis vectorstore?'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 54128, Requested 17248. Please try again in 11.376s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[-------------------------------->                 ] 57/86

Chain failed for example c823482b-7ff4-4608-9ef1-755f2cd6be7d with inputs {'question': 'I am summarizing text contained in the variable chunks with load_summarize_chain.\n\nchain = load_summarize_chain(llm, chain_type="map_reduce")\nchain.run(chunks)\nI would like to add a tag when I run the chain that langsmith will capture. How?'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 58551, Requested 10076. Please try again in 8.627s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[--------------------------------->                ] 58/86

Chain failed for example a3862471-ca59-4a2a-abc6-e68bf6b235b7 with inputs {'question': 'what method should subclasses override if they can start producing output while input is still being generated'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 58009, Requested 16734. Please try again in 14.743s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[-------------------------------------->           ] 67/86

Chain failed for example 05c63d4a-82dd-419b-93ca-fed10e31d000 with inputs {'question': 'how do i run llama on vllm'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 48390, Requested 13634. Please try again in 2.023s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[----------------------------------------->        ] 73/86

Chain failed for example 3f8cde09-979b-47db-b73f-b689dc40748f with inputs {'question': 'what does runnable.predict() mean?'}
Error Type: RateLimitError, Message: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-4IzJSGSSKpWz7EAf8b3xiI4d on tokens per min (TPM): Limit 60000, Used 53432, Requested 23084. Please try again in 16.516s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


[------------------------------------------------->] 86/86

Unnamed: 0,feedback.score_string:accuracy,feedback.embedding_cosine_distance,feedback.faithfulness,error,execution_time,run_id
count,79.0,79.0,41.0,7,86.0,86
unique,,,,7,,86
top,,,,Error code: 429 - {'error': {'message': 'Rate ...,,fe4e150f-ab80-4b62-9b81-6772b0813364
freq,,,,1,,1
mean,0.56962,0.144144,0.702439,,5.179077,
std,0.332186,0.090606,0.340946,,6.02523,
min,0.1,0.024488,0.1,,0.991064,
25%,0.2,0.074388,0.3,,2.188493,
50%,0.7,0.109492,1.0,,3.259451,
75%,0.7,0.214402,1.0,,4.905194,


In [12]:
test_run.get_aggregate_feedback()

Unnamed: 0,feedback.score_string:accuracy,feedback.embedding_cosine_distance,feedback.faithfulness,error,execution_time,run_id
count,61.0,61.0,37.0,25,86.0,86
unique,,,,25,,86
top,,,,"Error code: 400 - {'error': {'message': ""This ...",,3c7beec6-78dd-452c-96e2-d5a7e42f115e
freq,,,,1,,1
mean,0.529508,0.125739,0.740541,,8.728728,
std,0.287138,0.064135,0.316631,,8.161951,
min,0.1,0.031547,0.1,,0.818421,
25%,0.3,0.075807,0.5,,2.345414,
50%,0.5,0.112076,1.0,,5.590454,
75%,0.7,0.169307,1.0,,12.845377,


# Comparing with other indexing strategies

The index used above retrieves the raw documents based on a single vector per document. It doesn't perform any additional chunking. You can try changing the chunking parameters when generating the index.

## Customizing Chunking

The simplest change you can make to the index is configure how you split the documents.

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


def transform_docs(docs):
    splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
    yield from splitter.split_documents(docs)


# Used for the cache
transformation_name = "recursive-text-cs4k-ol200"

retriever_factory = langchain_docs.retriever_factories["basic"]

chunked_retriever = retriever_factory(
    embeddings,
    transform_docs=transform_docs,
    transformation_name=transformation_name,
    search_kwargs={"k": 4},
)

0it [00:00, ?it/s]

In [11]:
chunked_results = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, chunked_retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    project_name=f"gpt-4o-mini qa-chain chunked {run_uid}",
    project_metadata={
        "index_method": "basic",
        "chunk_size": 4000,
        "chunk_overlap": 200,
        "embedding_model": "text-embedding-3-small",
        "llm": "gpt-4o-mini",
    },
    verbose=True,
)

View the evaluation results for project 'gpt-4o-mini qa-chain chunked c7f362' at:
https://smith.langchain.com/o/2586a6b8-a802-5f6f-b08e-ef250f997c21/datasets/1013d34f-58c9-44f4-974b-69d7c9c6b90d/compare?selectedSessions=0eed58cc-7bd8-4d83-be61-308cf5805bf8

View all tests for Dataset LangChain Docs Q&A at:
https://smith.langchain.com/o/2586a6b8-a802-5f6f-b08e-ef250f997c21/datasets/1013d34f-58c9-44f4-974b-69d7c9c6b90d
[------------------------------------------------->] 86/86

Unnamed: 0,feedback.score_string:accuracy,feedback.embedding_cosine_distance,feedback.faithfulness,error,execution_time,run_id
count,86.0,86.0,51.0,0.0,86.0,86
unique,,,,0.0,,86
top,,,,,,8cdb410d-1a7d-48a1-9731-33db316e4820
freq,,,,,,1
mean,0.509302,0.158267,0.65098,,3.01447,
std,0.338727,0.096378,0.372759,,1.320374,
min,0.1,0.029314,0.1,,0.896467,
25%,0.1,0.077053,0.3,,2.053875,
50%,0.5,0.121169,0.7,,2.910915,
75%,0.7,0.263016,1.0,,3.781327,


In [None]:
chunked_results.get_aggregate_feedback()

## Parent Document Retriever

This indexing technique chunks documents and generates 1 vector per chunk.
At retrieval time, the K "most similar" chunks are fetched, then the full parent documents are returned for the LLM to reason over.

This ensures the chunk is surfaced in its full natural context. It also can potentially improve the initial retrieval quality since the similarity scores are scoped to individual chunks.

Let's see if this technique is effective in our case.

In [None]:
retriever_factory = langchain_docs.retriever_factories["parent-doc"]

# Indexes the documents with the specified embeddings
parent_doc_retriever = retriever_factory(embeddings)

In [None]:
parent_doc_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, parent_doc_retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain parent-doc {run_uid}",
    project_metadata={
        "index_method": "parent-doc",
        "embedding_model": "thenlper/gte-base",
        "llm": "claude-2",
    },
    verbose=True,
)

In [None]:
parent_doc_test_run.get_aggregate_feedback()

## HyDE

HyDE (Hypothetical document embeddings) refers to the technique of using an LLM
to generate example queries that my be used to retrieve a doc. By doing so, the resulting embeddings are automatically "more aligned" with the embeddings generated from the query. This comes with an additional indexing cost, since each document requires an additoinal call to an LLM while indexing.

In [None]:
retriever_factory = langchain_docs.retriever_factories["hyde"]

retriever = retriever_factory(embeddings)

In [None]:
hyde_test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=partial(chain_factory, retriever=retriever, llm=llm),
    evaluation=RAG_EVALUATION,
    verbose=True,
    project_name=f"claude-2 qa-chain HyDE {run_uid}",
    project_metadata={
        "index_method": "HyDE",
        "embedding_model": "thenlper/gte-base",
        "llm": "claude-2",
    },
)

In [None]:
hyde_test_run.get_aggregate_feedback()