# Evaluating RAG

The extent to which you can **evaluate** your system is the extent to which you can **improve** your system. Before going to prod, it is in your best interest to establish a framework for quickly and effectively understanding the quality of your RAG application. In this notebook, we will use the RAGAS framework, as proposed by [this paper](https://arxiv.org/pdf/2309.15217), to evaluate the RAG application developed in the previous examples. 

There is no substitute for reading the paper, but summarized below are the main metrics we will work with. Note: there are many more metrics that can be used depending on use case but these are the main ones covered in the paper so we will start there. 

# Quality metric breakdown

The 3 quality metrics in the RAGAS framework are: **faithfulness**, **answer relevance**, and **context relevance**. Let's take a moment to define each and understand how we can arrive at their values.

## Faithfulness

An answer to a question can be said to be "faithful" if the **claims** that are made in the answer **can be inferred** from the **context**.

The process for quantifying this score is as follows:

1. Use the following prompt with an LLM to generate shorter more focused statements provided the question and answer.

    > Given a question and answer, create one
    > or more statements from each sentence
    > in the given answer.
    > question: [question]
    > answer: [answer]

2. For each generated statement, verify if it can be inferred from the context with the following prompt.

    > Consider the given context and following
    > statements, then determine whether they
    > are supported by the information present
    > in the context. Provide a brief explanation for each statement before arriving
    > at the verdict (Yes/No). Provide a final
    > verdict for each statement in order at the
    > end in the given format. Do not deviate
    > from the specified format.
    > statement: [statement 1]
    > ...
    > statement: [statement n]

3. The final score can then be calculated Faithfulness = (number of supported statements) / (total number of statements)

## Answer Relevance

An answer can be said to be relevant if it directly addresses the question (intuitively).

The process for quantifying this score is:

1. Use an LLM to generate "hypothetical" questions to a given answer with the following prompt:

    > Generate a question for the given answer.
    > answer: [answer]

2. Embed the generated "hypothetical" questions as vectors.
3. Calculate the cosine similarity of the hypothetical questions and the original question, sum those similarities, and divide by n.

Expressed computationally: `Answer Relevance = sum(cos_sim((q, q_i) for q_i in n)) / n`

## Context Relevance

"The context is considered relevant to the extent that it exclusively contains information that is needed to answer the question."

The process:

1. Use the following LLM prompt to extract a subset of sentences necessary to answer the question. The context is defined as the formatted search result from the vector database.

    > Please extract relevant sentences from
    > the provided context that can potentially
    > help answer the following `{question}`. If no
    > relevant sentences are found, or if you
    > believe the question cannot be answered
    > from the given context, return the phrase
    > "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences
    > from given `{context}`.

2. Compute the context relevance score = (number of extracted sentences) / (total number of sentences in context)

# Let's start coding!

If you just finished the other examples this may already be done for you.


# Initialize Redis and create chunks to populate the index

In [1]:
import os
import warnings
import dotenv
# mute warnings
warnings.filterwarnings('ignore')
# load env vars from .env file
dotenv.load_dotenv()
dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["ROOT_DIR"] = parent_directory
REDIS_URL = os.getenv("REDIS_URL")
print(dir_path)
print(parent_directory)

/Users/rouzbeh.farahmand/PycharmProjects/boa-financial-rag-workshop/3_evaluation
/Users/rouzbeh.farahmand/PycharmProjects/boa-financial-rag-workshop


In [2]:
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from redis import Redis

index_name = 'langchain'
prefix = 'chunk'
schema = IndexSchema.from_yaml(f'{parent_directory}/helpers/sec_index.yaml')
client = Redis.from_url(REDIS_URL)

# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

08:09:47 redisvl.index.index INFO   Index already exists, overwriting.


In [2]:
# configure env
import json
import os
import warnings
os.environ["ROOT_DIR"] = parent_directory
#setting the local downloaded sentence transformer models folder
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

In [4]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
from helpers.ingestion import get_sec_data
from helpers.ingestion import redis_bulk_upload

embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))
sec_data = get_sec_data()

chunks = redis_bulk_upload(sec_data, index, embeddings, tickers=['AAPL']) #, chunk_size=2500) 

 ✅ Loaded doc info for  110 tickers...
✅ Loaded 108 10K chunks for ticker=AAPL from AAPL-2021-10K.pdf
✅ Loaded 94 10K chunks for ticker=AAPL from AAPL-2023-10K.pdf
✅ Loaded 103 10K chunks for ticker=AAPL from AAPL-2022-10K.pdf
✅ Loaded 27 earning_call chunks for ticker=AAPL from 2018-May-01-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2019-Oct-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2016-Jan-26-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2020-Jul-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2017-Aug-01-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2020-Jan-28-AAPL.txt
✅ Loaded 34 earning_call chunks for ticker=AAPL from 2016-Apr-26-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2017-Jan-31-AAPL.txt
✅ Loaded 28 earning_call chunks for ticker=AAPL from 2019-Apr-30-AAPL.txt
✅ Loaded 26 earning_call chunks for ticker=AAPL from 2017-Nov-02-AAPL.txt
✅ Loaded 31 earning_call chunks f

In [5]:
flattened_chunks = [item for sublist in chunks for item in sublist]
len(flattened_chunks)

TypeError: 'NoneType' object is not iterable

# Populate index and create vector store
This is entirely the same as we have done in the previous examples

In [7]:
from langchain_community.vectorstores import Redis as LangChainRedis
from helpers.utils import create_langchain_schemas_from_redis_schema

index_name = 'langchain'

vec_schema , main_schema = create_langchain_schemas_from_redis_schema(f'{parent_directory}/helpers/sec_index.yaml')

rds = LangChainRedis.from_existing_index(
    embedding=embeddings, 
    index_name= index_name, 
    schema = main_schema
)

## Test it out!
We can see the vector store is populated and returning results.

In [8]:
rds.similarity_search("What was apples revenue last year?")[0]

Document(page_content="Earlier this month, released macOS Catalina with all new entertainment apps, innovative Sidecar feature that uses iPad to expand Mac workspace and new accessibility tools that enable users to control their Mac entirely with their voice. 1. Catalina brings Apple Arcade experience to Mac. 1. Already seeing some third-party developers bring their iPad apps to Mac App Store with Mac Catalyst, including Twitter, Post-it and more. 4. Launching newly redesigned Mac Pro this fall, which Co. is manufacturing in Austin, Texas. 7. Others: 1. In FY19, crossed $100b in revenue in US for first time. 2. Introduce new services from Apple Card to Apple TV+ and generated over $46b in total Services revenue, setting new yearly Services records in all five geographic segments and driving Services business to size of Fortune 70 co. 3. Delivered new hardware in all device categories. 4. Wearables business showed explosive growth and generated more annual revenue than two-thirds of com

# Setup RAG

In [11]:
from langchain_community.llms import Ollama

# we will use llama3 as our local llm for this use case
llm = Ollama(model="llama3")

In [13]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

In [14]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}
    

# options 
# search_type="similarity_distance_threshold",
# search_kwargs={"distance_threshold":0.8, 'include_metadata': True}

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

# Now we have our RAG QA to test out

In [34]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res

score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What was Apple's revenue last year compared to this year??",
 'result': 'Question: What was Apple\'s revenue last year compared to this year?\nAnswer: According to the context, in fiscal year \'18, Apple\'s revenue grew by $36.4 billion. In addition, Q4 revenue was up 29% over last year, reaching a new September quarter record.\nSource: The answer is based on the following lines from the text:\n"In fiscal year \'18, Apple\'s revenue grew by $36.4 billion."\n"Q4 revenue was up 29% over last year, reaching a new September quarter record."',
 'source_documents': [Document(page_content="Thank you, Nancy. Good afternoon, everyone, and thanks for joining us. I just got back from Brooklyn, where we marked our fourth major launch at the end of the year. In addition to being a great time, it put an exclamation point at the end of a remarkable fiscal 2018. This year, we shipped our 2 billionth iOS device, celebrated the 10th anniversary of the App Store and achieved the strongest reve

# Setup complete!
Now let's generate some test questions to evaluate the answering abilities of the RAG QA using the metrics we introduced at the beginning. To do this we can use the LLM to come up with some potential questions.

In [128]:
with open("evaluation/questions.json", "r") as f:
    questions = json.load(f)

questions

["What is Apple's total revenue for 2023 compared to the previous year?",
 'What percentage increase in Services revenue did Apple report in 2023?',
 "How much has Apple's gross margin increased/decreased over the past three years?",
 "What was Apple's operating cash flow for 2023, and how does it compare to 2022?",
 'In what sectors did Apple see significant growth in its hardware sales (e.g., Mac, iPad, etc.)?',
 "By what percentage did Apple's iPhone revenue increase or decrease in 2023 compared to the previous year?",
 "What was Apple's research and development expense for 2023, and how does it compare to 2022?",
 "How has Apple's capital expenditures changed over the past five years?",
 'In what regions did Apple see significant growth in its sales (e.g., Asia, Americas, etc.)?',
 "By what percentage did Apple's China revenue increase or decrease in 2023 compared to the previous year?",
 "What was Apple's effective tax rate for 2023, and how does it compare to the previous year?",

In [None]:
if not len(questions):
    prompt = """
        You are a helpful question generating bot.
        Generate 15 questions you might ask about Apple's financial performance from it's 2023 annual report, earnings calls,
        and other financial documents. Return the response without any additional text as a json object of the form
        {"questions": [question1, question2, ..., question15]}
    """

    questions = json.loads(llm.generate([prompt]).generations[0][0].text)["questions"]
    questions

# Utilize TestSetGenerator from ragas to generate test questions

Note this can be a time consuming process so we have gone ahead and pregenerated this with the following code.

In [None]:
%pip install -q llama-index-embeddings-huggingface llama-index-llms-ollama llama-index-embeddings-instructor

In [34]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

generator_llm = Ollama(model="llama3")
critic_llm = Ollama(model="llama3")
embeddings = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

generator = TestsetGenerator.from_llama_index(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

In [35]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

source = "../resources/10k/aapl-10k-2023.pdf"

reader = SimpleDirectoryReader(input_files=[source])

documents = reader.load_data()

In [None]:
# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents[:30],
    test_size=15,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

testset.to_pandas().to_csv("gen_testset.csv")

In [15]:
# for simplicity here is the question output of the previous steps.
testset_questions = [
    "What services does Apple offer through its Payment Services, and how do these services contribute to the company's overall sales?",
    "What is the estimated maximum one-day loss in fair value of the Company's foreign currency derivative positions, according to the VAR model as of September 24, 2022?",
    "What drives Apple Inc.'s competitive edge & how does it impact results & financials?",
    "What are potential risks for Apple if it doesn't meet regulatory expectations or faces antitrust scrutiny, given its ESG investments & reliance on 3rd party data?",
    "What factors contributed to the 7% boost in iPhone sales and how did this growth align with Apple's recent product launches and industry developments?"
]

In [41]:
testset_questions = [
    "What services does Apple offer through its Payment Services, and how do these services contribute to the company's overall sales?",
    "What is the estimated maximum one-day loss in fair value of the Company's foreign currency derivative positions, according to the VAR model as of September 24, 2022?",
    "What drives Apple Inc.'s competitive edge & how does it impact results & financials?",
    "What are potential risks for Apple if it doesn't meet regulatory expectations or faces antitrust scrutiny, given its ESG investments & reliance on 3rd party data?",
    "What factors contributed to the 7% boost in iPhone sales and how did this growth align with Apple's recent product launches and industry developments?",
    "What were the primary factors driving the increase in Americas' net sales in 2022 compared to 2021?",
    "What impact do new product and service introductions have on Apple's net sales, cost of sales, and operating expenses?",
    "What are the main characteristics of the Company's manufacturing purchase obligations as of September 24, 2022?",
    "What is the trading symbol for Apple Inc.'s common stock?",
    "What is the estimated maximum one-day loss in fair value of the Company's foreign currency derivative positions as of September 24, 2022?",
    "What are the potential consequences if Apple Inc. fails to obtain licenses for third-party intellectual property or uses such intellectual property on unreasonable terms?",
    "What role does the company culture play in its ability to recruit and retain highly skilled employees, and how might it impact the business if not managed effectively?",
    "What drove iPhone net sales growth in 2022, considering Q4 2021 saw new models released?",
    "What OS choices support Apple's diverse product lineup?",
    "What drives changes in Apple's effective tax rates, and how do factors like earnings mix, statutory tax rates, and tax laws influence these fluctuations?",
    "What could go awry for Apple's top-grossing item, leading to a hit on Q2 earnings?",
    "What obstacles might affect the Company's DRM & security solution progress, potentially straining ties with tech partners?",
    "How might Apple's performance be impacted if economic headwinds intensify, driving up competition & eroding consumer trust?"
]

# Helper function for creating test dataset

In the following code we take a list of questions and a QA retrieval chain as input. We call the chain and store the answer returned along with the context (aka source documents) to be used as the essential data for our evaluation.


In [42]:

# define reusable helper function for evaluating our test set against different chains

from datasets import Dataset
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_relevancy,
)

from ragas import evaluate

def parse_contexts(source_docs):
    return [doc.page_content for doc in source_docs]

def create_evaluation_dataset(chain, questions):
    res_set = {
        "question": [],
        "answer": [],
        "contexts": [],
    }

    for question in questions:
        # call QA chain
        result = chain(question)

        res_set["question"].append(question)
        res_set["answer"].append(result["result"])
        res_set["contexts"].append(parse_contexts(result["source_documents"]))
    return Dataset.from_dict(res_set)

def evaluate_chain(chain, questions, test_name):
    eval_dataset = create_evaluation_dataset(chain, questions)

    eval_result = evaluate(
        eval_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_relevancy
        ]
    )

    eval_df = eval_result.to_pandas()
    # store the results of our test for future reference in csv
    eval_df.to_csv(f"{test_name}.csv")
    return eval_df

In [44]:
import getpass
# from llama_index.llms.ollama import Ollama
# from langchain_community.llms import Ollama

# by default ragas evaluation uses OpenAI
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")

basic_rag_test = evaluate_chain(qa, testset_questions, "basic_rag_ragas_testset_20")



[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.
score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.
score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.
score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.
score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.
score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

In [45]:
basic_rag_test.describe()

Unnamed: 0,faithfulness,answer_relevancy,context_relevancy
count,18.0,18.0,18.0
mean,0.774471,0.941588,0.026878
std,0.372733,0.235102,0.027302
min,0.0,0.0,0.005495
25%,0.589286,0.999986,0.01087
50%,1.0,1.0,0.01411
75%,1.0,1.0,0.032585
max,1.0,1.0,0.1


# Analysis

We can see from the above results that our basic RAG didn't score particularly well. This is okay because now that we have a baseline for the performance of our RAG, we can begin to try different techniques to improve our results. The reason it is so important to have a framework in place for evaluation is now we can properly experiment with different techniques to see what improves our particular system.

One technique we could try is to implement a parent document retriever. A parent document retriever attempts to optimize two competing objectives within RAG - 1) smaller chunks can lead to better embeddings since there is less context to lose the point (so to speak) 2) larger chunks help retain what could be valuable overall context to retrieval. Parent document retrieval allows for the initial query search on smaller chunks for specificity but returns the larger chunks for more complete context. 

Let's perform an experiment to see if this technique improves our metrics

In [46]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader, UnstructuredFileLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores.redis import Redis as LangChainRedis

# We will make a new index for this example defined directly

In [47]:
from langchain.document_loaders import UnstructuredFileLoader, TextLoader


# load our multi modal docs
source_docs = []

for doc in sec_data["AAPL"]["10K_files"]:
    loader = UnstructuredFileLoader(
        doc, mode="single", strategy="fast"
    )

    source_docs.extend(loader.load())

for doc in sec_data["AAPL"]["transcript_files"]:
    loader = TextLoader(doc)

    source_docs.extend(loader.load())



In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

PARENT_CHUNK_SIZE = 5000
CHILD_CHUNK_SIZE = 400

# This text splitter is used to create the parent documents aka larger chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=PARENT_CHUNK_SIZE)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=CHILD_CHUNK_SIZE)


In [52]:
# embeddings for redis vector store
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Note: it is **critical** that our index includes the `doc_id` field otherwise the parent document linking will not happen correctly. 

In [53]:
from langchain.storage import InMemoryStore

# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "chunk_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}, {"name": "doc_id"}],
    "content_vector_key": "chunk_vector" ,   # name of the vector field in langchain
}

vector_store = LangChainRedis(
    REDIS_URL,
    "child_docs",
    embeddings,
    index_schema=index_schema
)

# The storage layer for the parent documents
store = InMemoryStore()

In [54]:
from langchain.retrievers import ParentDocumentRetriever

parent_doc_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [55]:
# Note: we are adding the source documents and the ParentDocumentRetriever will automatically split them into parent and child documents
parent_doc_retriever.add_documents(source_docs)

In [56]:
# test that the retirever works
retrieved_docs = parent_doc_retriever.invoke("apples's revenue 2023")
retrieved_docs[0]

Document(page_content='The Company evaluates the performance of its reportable segments based on net sales and operating income. Net sales for geographic segments are generally based on the location of customers and sales through the Company’s retail stores located in those geographic locations. Operating income for each segment includes net sales to third parties, related cost of sales and operating expenses directly attributable to the segment. Advertising expenses are generally included in the geographic segment in which the expenditures are incurred. Operating income for each segment excludes other income and expense and certain expenses managed outside the reportable segments. Costs excluded from segment operating income include various corporate expenses such as research and development (“R&D”), corporate marketing expenses, certain share-based compensation expenses, income taxes, various nonrecurring charges and other separately managed general and administrative costs. The Comp

In [57]:
# keep the same but use our new retriever
parent_doc_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=parent_doc_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

In [58]:
parent_doc_test = evaluate_chain(parent_doc_qa, testset_questions, "parent_doc_ragas_testset")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

No statements were generated from the answer.


In [59]:
parent_doc_test.describe()

Unnamed: 0,faithfulness,answer_relevancy,context_relevancy
count,17.0,18.0,18.0
mean,0.847594,0.665775,0.033181
std,0.297799,0.484437,0.046002
min,0.0,0.0,0.0
25%,0.909091,0.0,0.01119
50%,1.0,0.999999,0.013699
75%,1.0,1.0,0.029096
max,1.0,1.0,0.166667


# Analysis

It appears that implementing the parent doc retriever did not meaningfully impact the performance of our RAG application. This is okay because now we know that we need to look into alternative techniques to improve our results. In this series we have provided many different ways to enhance our RAG application this notebook shows how we might measure which ones is the best for our particular use case. 

# Conclusion


As a review, in this notebook we covered:
- why it's important to have an evaluation framework
- the basic theory of RAGAS
- how to interpret and generate faithfulness, answer_relevancy, and context_relevancy
- code to evaluate two different RAG chains to monitor how using a parent document retriever might improve our results
