<div align="center">
    <div><img src="../assets/redis_logo.svg" style="width: 130px"> </div>
    <div style="display: inline-block; text-align: center; margin-bottom: 10px;">
        <span style="font-size: 36px;"><b>Evaluation with RAGAS</b></span>
        <br />
    </div>
    <br />
</div>

# Evaluating RAG

The extent to which you can **evaluate** your system is the extent to which you can **improve** your system. Before going to prod, it is in your best interest to establish a framework for quickly and effectively understanding the quality of your RAG application. In this notebook, we will use the RAGAS framework, as proposed by [this paper](https://arxiv.org/pdf/2309.15217), to evaluate our RAG application.

Before we dive into the theory though, let's setup the necessary environment and basic RAG application for evaluation.



In [1]:
import os
import warnings
import dotenv
# mute warnings
os.environ["LANGCHAIN_TRACING_V2"] = "false"
warnings.filterwarnings('ignore')
# load env vars from .env file
dotenv.load_dotenv()

dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["ROOT_DIR"] = parent_directory

#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"
SCHEMA_PATH = f"{parent_directory}/2_RAG_patterns_with_redis/sec_index.yaml"
SOURCE_DOC = '../resources/filings/AAPL/AAPL-2023-10K.pdf'

# Initialize Redis and create chunks to populate the index

In [2]:
# init Redis connection and index
import os
from redisvl.index import SearchIndex
from redis import Redis

# init Redis connection
# Replace values below with your own if using Redis Cloud instance
REDIS_URL = os.getenv("REDIS_URL")

prefix = 'chunk'
client = Redis.from_url(REDIS_URL)

In [3]:
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import numpy as np
import uuid

embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))

loader = UnstructuredFileLoader(SOURCE_DOC, mode="single", strategy="fast")

# for use later with parent-doc index
source_doc = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=0)
chunks = loader.load_and_split(text_splitter)


In [4]:
index_objs = [
    {
        "chunk_id": f"{chunk.metadata['source']}-{str(uuid.uuid4())}",
        "source_doc": f"{chunk.metadata['source']}",
        "content": chunk.page_content,
        "doc_type": "10k",
        "text_embedding": np.array(embeddings.embed_query(chunk.page_content)).astype(np.float32).tobytes()
    }
    for chunk in chunks
]

In [5]:
from redisvl.schema import IndexSchema

index_name = 'eval'

schema = IndexSchema.from_dict(
    {
        "index": {
            "name": index_name,
            "prefix": prefix,
            "storage_type": "hash",
        },
        "fields": [
            {"name": "chunk_id", "type": "tag"},
            {"name": "source_doc", "type": "tag"},
            {"name": "doc_type", "type": "tag"},
            {"name": "content", "type": "text"},
            {
                "name": "text_embedding", 
                "type": "vector", 
                "attrs": {"type": "float32", "dims": 384, "distance_metric": "COSINE", "algorithm": "flat"},
            }
        ]
    }
)


# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

In [6]:
keys = index.load(index_objs, id_field="chunk_id")
len(keys)

94

# Create vector store
This is the same processes as we have done in the previous examples

In [7]:
from langchain_community.vectorstores import Redis as LangChainRedis


# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "text_embedding",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}, {"name": "source_doc"}, {"name": "doc_type"}, {"name": "chunk_id"}],
    "content_vector_key": "text_embedding" ,   # name of the vector field in langchain
}


rds = LangChainRedis.from_existing_index(
    embedding=embeddings,
    index_name=index_name,
    schema=index_schema,
)

## Test it out!
We can see the vector store is populated and returning results.

In [8]:
rds.similarity_search("What was apples revenue last year?")[0]

Document(page_content='In May 2023, the Company announced a new share repurchase program of up to $90 billion and raised its quarterly dividend from $0.23 to $0.24 per share beginning in May 2023. During 2023, the Company repurchased $76.6 billion of its common stock and paid dividends and dividend equivalents of $15.0 billion.\n\nMacroeconomic Conditions\n\nMacroeconomic conditions, including inﬂation, changes in interest rates, and currency ﬂuctuations, have directly and indirectly impacted, and could in the future materially impact, the Company’s results of operations and ﬁnancial condition.\n\nApple Inc. | 2023 Form 10-K | 20\n\nSegment Operating Performance\n\nThe following table shows net sales by reportable segment for 2023, 2022 and 2021 (dollars in millions):\n\n2023\n\nChange\n\n2022\n\nChange\n\nNet sales by reportable segment:\n\nAmericas Europe Greater China Japan Rest of Asia Paciﬁc\n\nTotal net sales\n\n$\n\n$\n\n162,560 94,294 72,559 24,257 29,615 383,285\n\n(4)% $ (1)%

# Setup RAG

In [9]:
# from langchain_community.llms import Ollama

# # for use in ragas increase context window
# llm = Ollama(model="llama3", num_ctx=4097, temperature=0.1)

from langchain_community.llms import VLLMOpenAI

vllm = VLLMOpenAI(
            openai_api_key=os.environ["HF_MODEL_HUB_TOKEN"], # vllm token key for huggingface through openai like interface
            openai_api_base=os.environ["VLLM_URL"],
            model_name=os.environ["LOCAL_VLLM_MODEL"],
            temperature=0
        )

In [10]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

In [11]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}

qa = RetrievalQA.from_chain_type(
    llm=vllm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold", search_kwargs={"distance_threshold":0.8, 'include_metadata': True}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

# Now we have our RAG QA to test out

In [12]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What was Apple's revenue last year compared to this year??",
 'result': "Question: What was Apple's revenue last year compared to this year??\n\nAnswer: According to the provided context from Apple's 2023 Form 10-K, Apple's total net sales for 2022 were $394,328 million, and for 2023, they were $383,285 million. Therefore, Apple's revenue decreased by approximately $11 billion or 3% from 2022 to 2023.\n\nSource: Apple Inc. | 2023 Form 10-K | Pages 20-21",
 'source_documents': [Document(page_content='In May 2023, the Company announced a new share repurchase program of up to $90 billion and raised its quarterly dividend from $0.23 to $0.24 per share beginning in May 2023. During 2023, the Company repurchased $76.6 billion of its common stock and paid dividends and dividend equivalents of $15.0 billion.\n\nMacroeconomic Conditions\n\nMacroeconomic conditions, including inﬂation, changes in interest rates, and currency ﬂuctuations, have directly and indirectly impacted, and coul

# Setup complete!

In the resources we have included a pre-generated set of test data for evaluation generated with the TestsetGenerator class from the ragas library for demo speed sake. The code used to generate this data is provided as well. 

In [13]:
import pandas as pd
testset = pd.read_csv("resources/full_testset.csv")

## TestsetGenerator example code for generate testset

This can be a time consuming process so we have gone ahead and pregenerated this with the following code. See more on creating test sets [here](https://docs.ragas.io/en/latest/getstarted/testset_generation.html).

Note: while we are using synthetic test set here RAGAS can be utilized with human labeled data and [self created test sets](https://docs.ragas.io/en/stable/howtos/applications/data_preparation.html).

In [14]:
# if problems with nltk data
# import os
# os.environ["NLTK_DATA"] = '/Users/<user>/nltk_data'

if not len(testset):
    from ragas.testset.generator import TestsetGenerator
    from ragas.testset.evolutions import simple, reasoning, multi_context
    from llama_index.llms.ollama import Ollama
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    from llama_index.core import SimpleDirectoryReader

    generator_llm = vllm
    critic_llm = vllm
    embeddings = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

    generator = TestsetGenerator.from_llama_index(
        generator_llm=generator_llm,
        critic_llm=critic_llm,
        embeddings=embeddings,
    )

    reader = SimpleDirectoryReader(input_files=[SOURCE_DOC])

    documents = reader.load_data()

    testset = generator.generate_with_llamaindex_docs(
        documents,
        test_size=20,
        distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    )

    testset.to_pandas().to_csv("full_testset.csv")

# Begin evaluation
[The ragas library](https://docs.ragas.io/en/stable/index.html) provides helpful classes for abstracting the complexity of creating test sets and evaluating apps that use generative technology. Above we demonstrated how the TestsetGenerator class can be used to create an example dataset with. Now we will create a few helper functions to store and aggregate the answers/ context generated/retrieved from the RAG QA app we defined earlier. This data will be what we pass to the ragas library for calculating our performance metrics.


In [32]:
from datasets import Dataset
from ragas import evaluate
from ragas.run_config import RunConfig

def parse_contexts(source_docs):
    return [doc.page_content for doc in source_docs]

def create_evaluation_dataset(chain, testset):
    res_set = {
        "question": [],
        "answer": [],
        "contexts": [],
        "ground_truth": []
    }

    for _, row in testset.iterrows():
        # call QA chain
        result = chain.invoke(row["question"])

        res_set["question"].append(row["question"])
        res_set["answer"].append(result["result"])

        contexts = parse_contexts(result["source_documents"])
        
        if not len(contexts):
            print(f"no contexts found for question: {row['question']}")
        res_set["contexts"].append(contexts)
        res_set["ground_truth"].append(str(row["ground_truth"]))

    return res_set

def evaluate_chain(chain, testset, test_name, metrics, llm, embeddings):
    eval_dataset = create_evaluation_dataset(chain, testset)

    parsed = Dataset.from_dict(eval_dataset)

    run_config = RunConfig()
    run_config.max_retries = 1


    eval_result = evaluate(
        parsed,
        metrics=metrics,
        run_config=run_config,
        llm=llm,
        embeddings=embeddings
    )

    eval_df = eval_result.to_pandas()
    # store the results of our test for future reference in csv
    eval_df.to_csv(f"{test_name}.csv")
    return eval_df

# First let's evaluate generation metrics
Generation metrics quantify how well the RAG app did creating answers to the provided questions (i.e. the G in **R**etrival **A**ugments **G**eneration). We will calculate the generation metrics **faithfulness** and **answer relevancy** for this example.

The ragas libary conveniently abstracts the calculation of these metrics so we don't have to write redundant code but please review the following definitions in order to build intuition around what these metrics actually measure.

Note: the following examples are paraphrased from the [ragas docs](https://docs.ragas.io/en/stable/concepts/metrics/index.html)

------

### Faithfulness

An answer to a question can be said to be "faithful" if the **claims** that are made in the answer **can be inferred** from the **context**.

#### Mathematically:

$$
Faithfullness\ score = \frac{Number\ of\ claims\ in\ the\ generated\ answer\ that\ can\ be\ inferred\ from\ the\ given\ context}{Total\ number\ of\ claim\ in\ the\ generated\ answer}
$$

#### Example process:

> Question: Where and when was Einstein born?
> 
> Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time
>
> answer: Einstein was born in Germany on 20th March 1879.

Step 1: Use LLM to break generated answer into individual statements.
- “Einstein was born in Germany.”
- “Einstein was born on 20th March 1879.”

Step 2: For each statement use LLM to verify if it can be inferred from the context.
- “Einstein was born in Germany.” => yes. 
- “Einstein was born on 20th March 1879.” => no.

Step 3: plug into formula

#claims inferred from context = 1
#tot claims = 2
Faithfulness = 1/2

### Answer Relevance

An answer can be said to be relevant if it directly addresses the question (intuitively).

#### Example process:

1. Use an LLM to generate "hypothetical" questions to a given answer with the following prompt:

    > Generate a question for the given answer.
    > answer: [answer]

2. Embed the generated "hypothetical" questions as vectors.
3. Calculate the cosine similarity of the hypothetical questions and the original question, sum those similarities, and divide by n.

With data:

> Question: Where is France and what is it’s capital?
> 
> answer: France is in western Europe.

Step 1 - use LLM to create 'n' variants of question from the generated answer.

- “In which part of Europe is France located?”
- “What is the geographical location of France within Europe?”
- “Can you identify the region of Europe where France is situated?”

Step 2 - Calculate the mean cosine similarity between the generated questions and the actual question.

## Now let's implement using our helper functions



In [16]:

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_precision,
    context_recall
)

gen_metrics = [
    answer_relevancy,
    faithfulness,
]

gen_basic_rag_test = evaluate_chain(qa, testset, "generation_basic_rag", gen_metrics, vllm, embeddings)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

Evaluating:   0%|          | 0/38 [00:00<?, ?it/s]

In [17]:
gen_basic_rag_test.describe()

Unnamed: 0,answer_relevancy,faithfulness
count,19.0,19.0
mean,0.846994,0.824561
std,0.313799,0.325006
min,0.0,0.0
25%,0.890512,0.833333
50%,0.997366,1.0
75%,1.0,1.0
max,1.0,1.0


# Next let's evaluate the retrieval metrics

Retrieval metrics quantify how well the system performed at fetching the best possible context for generation. Like before please review the definitions below to understand what happens under-the-hood when we execute the evaluation code. 

-----

### Context Relevance

"The context is considered relevant to the extent that it exclusively contains information that is needed to answer the question."

#### Example process:

1. Use the following LLM prompt to extract a subset of sentences necessary to answer the question. The context is defined as the formatted search result from the vector database.

    > Please extract relevant sentences from
    > the provided context that can potentially
    > help answer the following `{question}`. If no
    > relevant sentences are found, or if you
    > believe the question cannot be answered
    > from the given context, return the phrase
    > "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences
    > from given `{context}`.

2. Compute the context relevance score = (number of extracted sentences) / (total number of sentences in context)

Moving from the initial paper to the active evaluation library ragas there are a few more insightful metrics to evaluate. From the library [source](https://docs.ragas.io/en/stable/concepts/metrics/index.html) let's introduce `context precision` and `context recall`. 

### Context recall
Context can be said to have high recall if retrieved context aligns with the ground truth answer.

#### Mathematically:

$$
Context\ recall = \frac{Ground\ Truth\ sentences\ that\ can\ be\ attributed\ to\ context}{Total\ number\ of\ sentences\ in\ the\ ground\ truth}
$$

#### Example process:

Data:
> question: Where is France and what is it’s capital?
> ground truth answer: France is in Western Europe and its capital is Paris.
> context: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.
>
> Note: ground truth answer can be created by critic LLM of with own human labeled data set.

Step 1 - use an LLM to break the ground truth down into individual statements:
- `France is in Western Europe`
- `Its capital is Paris`

Step 2 - for each ground truth statement, use an LLM to determine if it can be attributed from the context.
- `France is in Western Europe` => yes
- `Its capital is Paris` => no


Step 3 - plug in to formula

context recall = (1 + 0) / 2 = 0.5

### Context precision

This metrics relates to how chunks are ranked in a response. Ideally the most relevant chunks are at the top.

#### Mathematically:

$$
Context\ Precision@k = \frac{precision@k}{total\ number\ relevant\ items\ in\ the\ top\ k\ results}
$$

$$
Precision@k = \frac{true\ positive@k}{true\ positives@k + false\ positives@k}
$$

#### Example process:

Data:
> Question: Where is France and what is it’s capital?
> 
> Ground truth: France is in Western Europe and its capital is Paris.
> 
> Context: [ “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”, “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”]

Step 1 - for each chunk use the LLM to check if it's relevant or not to the ground truth answer.

Step 2 - for each chunk in the context calculate the precision defined as: ``
- `“The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”` => precision = 0/1 or 0.
- `“France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”` => the precision would be (1) / (1 true positive + 1 false positive) = 0.5. 


Step 3 - calculate the overall context precision = (0 + 0.5) / 1 = 0.5

In [29]:
ret_metrics = [
    context_precision,
    context_recall
]

ret_basic_rag_test = evaluate_chain(qa, testset, "retrieval_basic_rag", ret_metrics, vllm, embeddings)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

Evaluating:   0%|          | 0/38 [00:00<?, ?it/s]

In [30]:
ret_basic_rag_test.describe()

Unnamed: 0,context_precision,context_recall
count,19.0,19.0
mean,0.862573,0.684211
std,0.286061,0.477567
min,0.0,0.0
25%,0.958333,0.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


# Analysis

We can see from the above results that our basic RAG did okay but not necessarily great. This is okay because now that we have a baseline for the performance of our RAG, we can begin to try different techniques to improve our results. The reason it is so important to have a framework in place for evaluation is now we can properly experiment with different techniques to see what improves our particular system.

One technique we could try is to implement a parent document retriever. A parent document retriever attempts to optimize two competing objectives within RAG - 1) smaller chunks can lead to better embeddings since there is less context to lose the point (so to speak) 2) larger chunks help retain what could be valuable overall context to retrieval. Parent document retrieval allows for the initial query search on smaller chunks for specificity but returns the larger chunks for more complete context. 

Let's perform an experiment to see if this technique improves our metrics

In [20]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader, UnstructuredFileLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores.redis import Redis as LangChainRedis

# We will make a new index for this example defined directly

In [34]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

PARENT_CHUNK_SIZE = 3000
CHILD_CHUNK_SIZE = 400

# This text splitter is used to create the parent documents aka larger chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=PARENT_CHUNK_SIZE)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=CHILD_CHUNK_SIZE)


In [35]:
# embeddings for redis vector store
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Note: it is **critical** that our index includes the `doc_id` field otherwise the parent document linking will not happen correctly. 

In [36]:
from langchain.storage import InMemoryStore

# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "chunk_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}, {"name": "doc_id"}],
    "content_vector_key": "chunk_vector" ,   # name of the vector field in langchain
}

vector_store = LangChainRedis(
    REDIS_URL,
    "child_docs",
    embeddings,
    index_schema=index_schema
)

# The storage layer for the parent documents
store = InMemoryStore()

In [37]:
from langchain.retrievers import ParentDocumentRetriever

parent_doc_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [38]:
# Note: we are adding the source documents and the ParentDocumentRetriever will automatically split them into parent and child documents
parent_doc_retriever.add_documents(source_doc)

In [39]:
# test that the retirever works
retrieved_docs = parent_doc_retriever.invoke("apples's revenue 2023")
retrieved_docs[0]

Document(page_content='Fiscal Year Highlights\n\nThe Company’s total net sales were $383.3 billion and net income was $97.0 billion during 2023.\n\nThe Company’s total net sales decreased 3% or $11.0 billion during 2023 compared to 2022. The weakness in foreign currencies relative to the U.S. dollar accounted for more than the entire year-over-year decrease in total net sales, which consisted primarily of lower net sales of Mac and iPhone, partially oﬀset by higher net sales of Services.\n\nThe Company announces new product, service and software oﬀerings at various times during the year. Signiﬁcant announcements during ﬁscal year 2023 included the following:\n\nFirst Quarter 2023:\n\n• • MLS Season Pass, a Major League Soccer subscription streaming service.\n\niPad and iPad Pro; Next-generation Apple TV 4K; and\n\nSecond Quarter 2023:\n\nMacBook Pro 14”, MacBook Pro 16” and Mac mini; and • Second-generation HomePod.\n\nThird Quarter 2023:\n\nMacBook Air 15”, Mac Studio and Mac Pro; •\n

In [40]:
# keep the same but use our new retriever
parent_doc_qa = RetrievalQA.from_chain_type(
    llm=vllm,
    chain_type="stuff",
    retriever=parent_doc_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

## Like before let's evaluate the generation metrics first

Note: it is often practical to not calculate all the metrics at once for rate limiting reasons. 

In [41]:
gen_metrics = [
    answer_relevancy,
    faithfulness,
]

gen_parent_doc_test = evaluate_chain(parent_doc_qa, testset, "generation_parent_doc", gen_metrics, vllm, embeddings)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

Evaluating:   0%|          | 0/38 [00:00<?, ?it/s]

In [42]:
gen_parent_doc_test.describe()

Unnamed: 0,answer_relevancy,faithfulness
count,19.0,19.0
mean,0.834437,0.761404
std,0.371819,0.329845
min,0.0,0.0
25%,0.980172,0.583333
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


## And the same for the retrieval metrics

In [45]:
ret_metrics = [
    context_precision,
    context_recall
]

ret_parent_doc_test = evaluate_chain(parent_doc_qa, testset, "retrieval_parent_doc", ret_metrics, vllm, embeddings)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

Evaluating:   0%|          | 0/38 [00:00<?, ?it/s]

Failed to parse output. Returning None.


In [None]:
ret_parent_doc_test.describe()

Unnamed: 0,context_precision,context_recall
count,19.0,19.0
mean,0.960526,0.701754
std,0.098999,0.425205
min,0.638889,0.0
25%,1.0,0.416667
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


# Analysis

In this case, we observe that the increased context provided by the parent document retriever had slightly negative effect on the generation metrics potentially reducing answer clarity via increased information. And a positive effect on retrieval metrics, especially context precision, indicating that the smaller chunks for query comparison helped order the relevant context but it appears that wasn't a limiting factor from the base case for this test. More conclusive testing would be needed to draw more authoritative conclusions but this example show us how to compare option in order to find the highest priority strategies for a given application. 

# Review


In this notebook we covered:
- why it's important to have an evaluation framework
- the basic theory of RAGAS
- how to calculate and generate faithfulness, answer_relevancy, context_precision, and context_recall
- code to evaluate two different RAG chains to monitor how using a different retrieval strategy effects performance


# Next steps: end-to-end evaluation

As your pipeline matures and human labeled ground truth data is created the following metrics can be added for increased rigor. These additional metrics can be implemented similarly as the ones showcased above.


## Answer correctness

A weighted average of semantic and factual similarity where weights can be passed as a parameter.

## Answer semantic similarity

Measure distance between ground truth and the generated answer.

#### Example process:
- vectorize the ground truth answer and the generated answer
- compute the cosine similarity.

## Answer factual similarity

Mathematically:

$$
F1\ Score = \frac{TP}{TP + 0.5(FP + FN)}
$$

Where:
TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.

FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.

FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.

#### Example process:

data:
> Ground truth: Einstein was born in 1879 in Germany.
> Generated Answer: Einstein was born in Spain in 1879.

TP: [Einstein was born in 1879]

FP: [Einstein was born in Spain]

FN: [Einstein was born in Germany]

F1 = (1 / 1 + 0.5(1 + 1)) = 1/2

