<div align="center">
    <div><img src="../assets/redis_logo.svg" style="width: 130px"> </div>
    <div style="display: inline-block; text-align: center; margin-bottom: 10px;">
        <span style="font-size: 36px;"><b>Evaluation with RAGAS</b></span>
        <br />
    </div>
    <br />
</div>

# Evaluating RAG

The extent to which you can **evaluate** your system is the extent to which you can **improve** your system. Before going to prod, it is in your best interest to establish a framework for quickly and effectively understanding the quality of your RAG application. In this notebook, we will use the RAGAS framework, as proposed by [this paper](https://arxiv.org/pdf/2309.15217), to evaluate our RAG application.

Before we dive into the theory though, let's setup the necessary environment and basic RAG application for evaluation.



In [None]:
!pip install -r requirements.txt

Collecting python-dotenv (from -r requirements.txt (line 2))
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting tiktoken (from -r requirements.txt (line 3))
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain (from -r requirements.txt (line 4))
  Downloading langchain-0.2.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community (from -r requirements.txt (line 5))
  Downloading langchain_community-0.2.11-py3-none-any.whl.metadata (2.7 kB)
Collecting sentence-transformers (from -r requirements.txt (line 7))
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting pdf2image (from -r requirements.txt (line 9))
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting spacy (from -r requirements.txt (line 10))
  Downloading spacy-3.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting langgrap

Collecting pikepdf (from unstructured[pdf]->-r requirements.txt (line 6))
  Downloading pikepdf-9.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.2 kB)
Collecting pillow-heif (from unstructured[pdf]->-r requirements.txt (line 6))
  Downloading pillow_heif-0.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.8 kB)
Collecting pypdf (from unstructured[pdf]->-r requirements.txt (line 6))
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting pytesseract (from unstructured[pdf]->-r requirements.txt (line 6))
  Downloading pytesseract-0.3.10-py3-none-any.whl.metadata (11 kB)
Collecting google-cloud-vision (from unstructured[pdf]->-r requirements.txt (line 6))
  Downloading google_cloud_vision-3.7.4-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting effdet (from unstructured[pdf]->-r requirements.txt (line 6))
  Downloading effdet-0.4.1-py3-none-any.whl.metadata (33 kB)
Collecting unstructured-inference==0.7.36 (from unstruct

In [1]:
import os
import warnings
import dotenv
# mute warnings
os.environ["LANGCHAIN_TRACING_V2"] = "false"
warnings.filterwarnings('ignore')
# load env vars from .env file
dotenv.load_dotenv()

dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["ROOT_DIR"] = parent_directory

#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"
SCHEMA_PATH = f"{parent_directory}/2_RAG_patterns_with_redis/sec_index.yaml"
SOURCE_DOC = '../resources/filings/AAPL/AAPL-2023-10K.pdf'

ModuleNotFoundError: No module named 'dotenv'

# Initialize Redis and create chunks to populate the index

In [99]:
# init Redis connection and index
import os
from redisvl.index import SearchIndex
from redis import Redis

# init Redis connection
# Replace values below with your own if using Redis Cloud instance
REDIS_URL = os.getenv("REDIS_URL")

prefix = 'chunk'
client = Redis.from_url(REDIS_URL)

In [100]:
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import numpy as np
import uuid

embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))

loader = UnstructuredFileLoader(SOURCE_DOC, mode="single", strategy="fast")

# for use later with parent-doc index
source_doc = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = loader.load_and_split(text_splitter)


In [101]:
index_objs = [
    {
        "chunk_id": f"{chunk.metadata['source']}-{str(uuid.uuid4())}",
        "source_doc": f"{chunk.metadata['source']}",
        "content": chunk.page_content,
        "doc_type": "10k",
        "text_embedding": np.array(embeddings.embed_query(chunk.page_content)).astype(np.float32).tobytes()
    }
    for chunk in chunks
]

In [102]:
from redisvl.schema import IndexSchema

index_name = 'eval'

schema = IndexSchema.from_dict(
    {
        "index": {
            "name": index_name,
            "prefix": prefix,
            "storage_type": "hash",
        },
        "fields": [
            {"name": "chunk_id", "type": "tag"},
            {"name": "source_doc", "type": "tag"},
            {"name": "doc_type", "type": "tag"},
            {"name": "content", "type": "text"},
            {
                "name": "text_embedding", 
                "type": "vector", 
                "attrs": {"type": "float32", "dims": 384, "distance_metric": "COSINE", "algorithm": "flat"},
            }
        ]
    }
)


# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

In [103]:
keys = index.load(index_objs, id_field="chunk_id")
len(keys)

263

# Create vector store
This is the same processes as we have done in the previous examples

In [104]:
from langchain_community.vectorstores import Redis as LangChainRedis


# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "text_embedding",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}, {"name": "source_doc"}, {"name": "doc_type"}, {"name": "chunk_id"}],
    "content_vector_key": "text_embedding" ,   # name of the vector field in langchain
}


rds = LangChainRedis.from_existing_index(
    embedding=embeddings,
    index_name=index_name,
    schema=index_schema,
)

## Test it out!
We can see the vector store is populated and returning results.

In [105]:
rds.similarity_search("What was apples revenue last year?")[0]

Document(page_content='Apple Inc. | 2023 Form 10-K | 27\n\nPage\n\n28\n\n29 30\n\n31\n\n32 33 49\n\nApple Inc.\n\nCONSOLIDATED STATEMENTS OF OPERATIONS (In millions, except number of shares, which are reﬂected in thousands, and per-share amounts)\n\nYears ended\n\nSeptember 30, 2023\n\nSeptember 24, 2022\n\nSeptember 25, 2021\n\nNet sales: Products Services\n\nTotal net sales\n\n$\n\n298,085 $ 85,200 383,285\n\n316,199 $ 78,129 394,328\n\n297,392 68,425 365,817\n\nCost of sales: Products Services\n\nTotal cost of sales Gross margin\n\n189,282 24,855 214,137 169,148\n\n201,471 22,075 223,546 170,782\n\n192,266 20,715 212,981 152,836\n\nOperating expenses:\n\nResearch and development Selling, general and administrative\n\nTotal operating expenses\n\n29,915 24,932 54,847\n\n26,251 25,094 51,345\n\n21,914 21,973 43,887\n\nOperating income Other income/(expense), net Income before provision for income taxes Provision for income taxes Net income\n\n$\n\n114,301 (565) 113,736 16,741 96,995 $\

# Setup RAG

Initialize llm examples shown for Ollama, OpenAI, and VLLM

In [106]:
from langchain_community.llms import Ollama, VLLMOpenAI
from langchain_openai import ChatOpenAI



# for Ollama use => increase context window
# llm = Ollama(model="llama3", num_ctx=4097, temperature=0.1)

llm = VLLMOpenAI(
            openai_api_key=os.environ["HF_MODEL_HUB_TOKEN"], # vllm token key for huggingface through openai like interface
            openai_api_base=os.environ["VLLM_URL"],
            model_name=os.environ["LOCAL_VLLM_MODEL"],
            temperature=0
        )

# llm = ChatOpenAI(
#     openai_api_key=os.environ["OPENAI_API_KEY"],
#     model="gpt-3.5-turbo-16k",
#     max_tokens=None
# )


In [58]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

In [107]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold", search_kwargs={"distance_threshold":0.8, 'include_metadata': True}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

# Now we have our RAG QA to test out

In [108]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What was Apple's revenue last year compared to this year??",
 'result': "Answer: Apple's revenue last year was $394.3 billion, while this year it was $383.3 billion.\nSource: Apple Inc. | 2023 Form 10-K | 27",
 'source_documents': [Document(page_content='Apple Inc. | 2023 Form 10-K | 27\n\nPage\n\n28\n\n29 30\n\n31\n\n32 33 49\n\nApple Inc.\n\nCONSOLIDATED STATEMENTS OF OPERATIONS (In millions, except number of shares, which are reﬂected in thousands, and per-share amounts)\n\nYears ended\n\nSeptember 30, 2023\n\nSeptember 24, 2022\n\nSeptember 25, 2021\n\nNet sales: Products Services\n\nTotal net sales\n\n$\n\n298,085 $ 85,200 383,285\n\n316,199 $ 78,129 394,328\n\n297,392 68,425 365,817\n\nCost of sales: Products Services\n\nTotal cost of sales Gross margin\n\n189,282 24,855 214,137 169,148\n\n201,471 22,075 223,546 170,782\n\n192,266 20,715 212,981 152,836\n\nOperating expenses:\n\nResearch and development Selling, general and administrative\n\nTotal operating expenses\n\

# Setup complete!

In the resources we have included a pre-generated set of test data for evaluation generated with the TestsetGenerator class from the ragas library for demo speed sake. The code used to generate this data is provided as well. 

In [23]:
import pandas as pd
testset = pd.read_csv("resources/full_testset.csv")

## TestsetGenerator example code for generate testset

This can be a time consuming process so we have gone ahead and pregenerated this with the following code. See more on creating test sets [here](https://docs.ragas.io/en/latest/getstarted/testset_generation.html).

Note: while we are using synthetic test set here RAGAS can be utilized with human labeled data and [self created test sets](https://docs.ragas.io/en/stable/howtos/applications/data_preparation.html).

In [24]:
# if problems with nltk data
# import os
# os.environ["NLTK_DATA"] = '/Users/<user>/nltk_data'

if not len(testset):
    from ragas.testset.generator import TestsetGenerator
    from ragas.testset.evolutions import simple, reasoning, multi_context
    from llama_index.llms.ollama import Ollama
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    from llama_index.core import SimpleDirectoryReader

    generator_llm = llm
    critic_llm = llm
    embeddings = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

    generator = TestsetGenerator.from_llama_index(
        generator_llm=generator_llm,
        critic_llm=critic_llm,
        embeddings=embeddings,
    )

    reader = SimpleDirectoryReader(input_files=[SOURCE_DOC])

    documents = reader.load_data()

    testset = generator.generate_with_llamaindex_docs(
        documents,
        test_size=20,
        distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    )

    testset.to_pandas().to_csv("full_testset.csv")

# Begin evaluation
[The ragas library](https://docs.ragas.io/en/stable/index.html) provides helpful classes for abstracting the complexity of creating test sets and evaluating apps that use generative technology. Above we demonstrated how the TestsetGenerator class can be used to create an example dataset with. Now we will create a few helper functions to store and aggregate the answers/ context generated/retrieved from the RAG QA app we defined earlier. This data will be what we pass to the ragas library for calculating our performance metrics.


In [109]:
from datasets import Dataset
from ragas import evaluate
from ragas.run_config import RunConfig

def parse_contexts(source_docs):
    return [doc.page_content for doc in source_docs]

def create_evaluation_dataset(chain, testset):
    res_set = {
        "question": [],
        "answer": [],
        "contexts": [],
        "ground_truth": []
    }

    for _, row in testset.iterrows():
        # call QA chain
        result = chain.invoke(row["question"])

        res_set["question"].append(row["question"])
        res_set["answer"].append(result["result"])

        contexts = parse_contexts(result["source_documents"])
        
        if not len(contexts):
            print(f"no contexts found for question: {row['question']}")
        res_set["contexts"].append(contexts)
        res_set["ground_truth"].append(str(row["ground_truth"]))

    return Dataset.from_dict(res_set)

def evaluate_dataset(eval_dataset, metrics, llm, embeddings):

    run_config = RunConfig()
    run_config.max_retries = 1


    eval_result = evaluate(
        eval_dataset,
        metrics=metrics,
        run_config=run_config,
        llm=llm,
        embeddings=embeddings
    )

    eval_df = eval_result.to_pandas()
    return eval_df

## Create the Dataset

Input: chain to be evaluated, testset
Output: dataset to pass to ragas evaluation function

In [110]:
eval_dataset = create_evaluation_dataset(qa, testset)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

In [113]:
eval_dataset.to_pandas().head()

Unnamed: 0,question,answer,contexts,ground_truth
0,What services does Apple offer through its Pay...,Apple offers two services through its Payment ...,"[The Company operates various platforms, inclu...",Apple offers payment services through Apple Ca...
1,What is the estimated maximum one-day loss in ...,The estimated maximum one-day loss in fair val...,[The Company applied a value-at-risk (“VAR”) m...,$1.0 billion
2,What drives Apple Inc.'s competitive edge & ho...,The competitive edge of Apple Inc. is driven b...,[The Company has a minority market share in th...,The information provided in the context sugges...
3,What are potential risks for Apple if it doesn...,Answer: The potential risks for Apple if it do...,[Apple Inc. | 2023 Form 10-K | 7\n\nThe Compan...,If Apple fails to meet regulatory expectations...
4,What factors contributed to the 7% boost in iP...,Answer: The information provided does not spec...,[(1) Products net sales include amortization o...,The 7% boost in iPhone sales was primarily due...


## Evaluate generation metrics

In [114]:
from ragas.metrics import faithfulness, answer_relevancy

# first generate for faithfulness
faithfulness_metrics = evaluate_dataset(eval_dataset, [faithfulness], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [115]:
# next for answer_relevancy
answer_relevancy_metrics = evaluate_dataset(eval_dataset, [answer_relevancy], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [116]:
gen_metrics_default = faithfulness_metrics
gen_metrics_default["answer_relevancy"] = answer_relevancy_metrics["answer_relevancy"]

gen_metrics_default.describe()

Unnamed: 0,faithfulness,answer_relevancy
count,19.0,19.0
mean,0.554265,0.714271
std,0.425378,0.383305
min,0.0,0.0
25%,0.118056,0.788299
50%,0.642857,0.887609
75%,1.0,0.930006
max,1.0,1.0


# What do these number mean and how were they calculated?

Note: the following examples are paraphrased from the [ragas docs](https://docs.ragas.io/en/stable/concepts/metrics/index.html)

------

### Faithfulness

An answer to a question can be said to be "faithful" if the **claims** that are made in the answer **can be inferred** from the **context**.

#### Mathematically:

$$
Faithfullness\ score = \frac{Number\ of\ claims\ in\ the\ generated\ answer\ that\ can\ be\ inferred\ from\ the\ given\ context}{Total\ number\ of\ claim\ in\ the\ generated\ answer}
$$

#### Example process:

> Question: Where and when was Einstein born?
> 
> Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time
>
> answer: Einstein was born in Germany on 20th March 1879.

Step 1: Use LLM to break generated answer into individual statements.
- “Einstein was born in Germany.”
- “Einstein was born on 20th March 1879.”

Step 2: For each statement use LLM to verify if it can be inferred from the context.
- “Einstein was born in Germany.” => yes. 
- “Einstein was born on 20th March 1879.” => no.

Step 3: plug into formula

Number of claims inferred from context = 1
Total number of claims = 2
Faithfulness = 1/2

### Answer Relevance

An answer can be said to be relevant if it directly addresses the question (intuitively).

#### Example process:

1. Use an LLM to generate "hypothetical" questions to a given answer with the following prompt:

    > Generate a question for the given answer.
    > answer: [answer]

2. Embed the generated "hypothetical" questions as vectors.
3. Calculate the cosine similarity of the hypothetical questions and the original question, sum those similarities, and divide by n.

With data:

> Question: Where is France and what is it’s capital?
> 
> answer: France is in western Europe.

Step 1 - use LLM to create 'n' variants of question from the generated answer.

- “In which part of Europe is France located?”
- “What is the geographical location of France within Europe?”
- “Can you identify the region of Europe where France is situated?”

Step 2 - Calculate the mean cosine similarity between the generated questions and the actual question.



# Evaluate retrieval metrics

In [120]:
from ragas.metrics import context_recall, context_precision

context_recall_metrics = evaluate_dataset(eval_dataset, [context_recall], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [118]:
context_precision_metrics = evaluate_dataset(eval_dataset, [context_precision], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [121]:
ret_metrics_default = context_recall_metrics
ret_metrics_default["context_precision"] = context_precision_metrics["context_precision"]

ret_metrics_default.describe()

Unnamed: 0,context_recall,context_precision
count,19.0,19.0
mean,0.72807,0.815789
std,0.427491,0.311283
min,0.0,0.0
25%,0.416667,0.708333
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


# What do these numbers mean?

Retrieval metrics quantify how well the system performed at fetching the best possible context for generation. Like before please review the definitions below to understand what happens under-the-hood when we execute the evaluation code. 

-----

### Context Relevance

"The context is considered relevant to the extent that it exclusively contains information that is needed to answer the question."

#### Example process:

1. Use the following LLM prompt to extract a subset of sentences necessary to answer the question. The context is defined as the formatted search result from the vector database.

    > Please extract relevant sentences from
    > the provided context that can potentially
    > help answer the following `{question}`. If no
    > relevant sentences are found, or if you
    > believe the question cannot be answered
    > from the given context, return the phrase
    > "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences
    > from given `{context}`.

2. Compute the context relevance score = (number of extracted sentences) / (total number of sentences in context)

Moving from the initial paper to the active evaluation library ragas there are a few more insightful metrics to evaluate. From the library [source](https://docs.ragas.io/en/stable/concepts/metrics/index.html) let's introduce `context precision` and `context recall`. 

### Context recall
Context can be said to have high recall if retrieved context aligns with the ground truth answer.

#### Mathematically:

$$
Context\ recall = \frac{Ground\ Truth\ sentences\ that\ can\ be\ attributed\ to\ context}{Total\ number\ of\ sentences\ in\ the\ ground\ truth}
$$

#### Example process:

Data:
> question: Where is France and what is it’s capital?
> ground truth answer: France is in Western Europe and its capital is Paris.
> context: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.
>
> Note: ground truth answer can be created by critic LLM of with own human labeled data set.

Step 1 - use an LLM to break the ground truth down into individual statements:
- `France is in Western Europe`
- `Its capital is Paris`

Step 2 - for each ground truth statement, use an LLM to determine if it can be attributed from the context.
- `France is in Western Europe` => yes
- `Its capital is Paris` => no


Step 3 - plug in to formula

context recall = (1 + 0) / 2 = 0.5

### Context precision

This metrics relates to how chunks are ranked in a response. Ideally the most relevant chunks are at the top.

#### Mathematically:

$$
Context\ Precision@k = \frac{precision@k}{total\ number\ relevant\ items\ in\ the\ top\ k\ results}
$$

$$
Precision@k = \frac{true\ positive@k}{true\ positives@k + false\ positives@k}
$$

#### Example process:

Data:
> Question: Where is France and what is it’s capital?
> 
> Ground truth: France is in Western Europe and its capital is Paris.
> 
> Context: [ “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”, “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”]

Step 1 - for each chunk use the LLM to check if it's relevant or not to the ground truth answer.

Step 2 - for each chunk in the context calculate the precision defined as: ``
- `“The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”` => precision = 0/1 or 0.
- `“France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”` => the precision would be (1) / (1 true positive + 1 false positive) = 0.5. 


Step 3 - calculate the overall context precision = (0 + 0.5) / 1 = 0.5

# Implement alternative chain for comparison: Parent Document Retriever

Now that we've established baseline metrics let's implement a chain using the parent document retriever approach to see how the retrieval strategies compare.

The parent document retriever attempts to optimize two competing objectives within RAG:

1. smaller chunks can lead to better embeddings since there is less context to lose the point (so to speak) 
2. larger chunks help retain what could be valuable overall context to retrieval. 

In theory, this approach allows for the initial query search on smaller chunks for specificity but returns the larger chunks for more complete context.

Let's implement and see if it improves our results

In [76]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader, UnstructuredFileLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores.redis import Redis as LangChainRedis

# We will make a new index for this example defined directly

In [122]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

PARENT_CHUNK_SIZE = 2500
CHILD_CHUNK_SIZE = 500

# This text splitter is used to create the parent documents aka larger chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=PARENT_CHUNK_SIZE)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=CHILD_CHUNK_SIZE)


In [123]:
# embeddings for redis vector store
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Note: it is **critical** that our index includes the `doc_id` field otherwise the parent document linking will not happen correctly. 

In [124]:
# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "chunk_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}, {"name": "doc_id"}],
    "content_vector_key": "chunk_vector" ,   # name of the vector field in langchain
}

vector_store = LangChainRedis(
    REDIS_URL,
    "child_docs",
    embeddings,
    index_schema=index_schema
)

In [125]:
from langchain.storage.encoder_backed import EncoderBackedStore
from langchain.storage import RedisStore
import pickle

def key_encoder(key: int | str) -> str:
    return str(key)

def value_serializer(value: float) -> str:
    return pickle.dumps(value)

def value_deserializer(serialized_value: str) -> float:
    return pickle.loads(serialized_value)

# Create an instance of the abstract store
base_store = RedisStore(redis_url="redis://localhost:6379", namespace="parent_docs")

# Create an instance of the encoder-backed store
encoder_store = EncoderBackedStore(
    store=base_store,
    key_encoder=key_encoder,
    value_serializer=value_serializer,
    value_deserializer=value_deserializer
)

In [126]:
from langchain.retrievers import ParentDocumentRetriever

parent_doc_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=encoder_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [127]:
# Note: we are adding the source documents and the ParentDocumentRetriever will automatically split them into parent and child documents
parent_doc_retriever.add_documents(source_doc)

In [128]:
# test that the retirever works
retrieved_docs = parent_doc_retriever.invoke("apples's revenue 2023")
retrieved_docs[0]

Document(page_content='2023\n\n2022\n\nGross margin:\n\nProducts Services\n\nTotal gross margin\n\n$\n\n$\n\n108,803 $ 60,345 169,148 $\n\n114,728 $ 56,054 170,782 $\n\nGross margin percentage:\n\nProducts Services\n\nTotal gross margin percentage\n\n36.5 % 70.8 % 44.1 %\n\n36.3 % 71.7 % 43.3 %\n\nProducts Gross Margin\n\nProducts gross margin decreased during 2023 compared to 2022 due to the weakness in foreign currencies relative to the U.S. dollar and lower Products volume, partially oﬀset by cost savings and a diﬀerent Products mix.\n\nProducts gross margin percentage increased during 2023 compared to 2022 due to cost savings and a diﬀerent Products mix, partially oﬀset by the weakness in foreign currencies relative to the U.S. dollar and decreased leverage.\n\nServices Gross Margin\n\nServices gross margin increased during 2023 compared to 2022 due primarily to higher Services net sales, partially oﬀset by the weakness in foreign currencies relative to the U.S. dollar and higher S

In [129]:
# keep the same but use our new retriever
parent_doc_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=parent_doc_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

## Like before let's first create the dataset

In [130]:
eval_dataset = create_evaluation_dataset(parent_doc_qa, testset)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

In [131]:
eval_dataset.to_pandas().head()

Unnamed: 0,question,answer,contexts,ground_truth
0,What services does Apple offer through its Pay...,Answer: Apple offers various services through ...,[Rest of Asia Paciﬁc\n\nRest of Asia Paciﬁc ne...,Apple offers payment services through Apple Ca...
1,What is the estimated maximum one-day loss in ...,Answer: The estimated maximum one-day loss in ...,[The Company applied a value-at-risk (“VAR”) m...,$1.0 billion
2,What drives Apple Inc.'s competitive edge & ho...,Apple Inc.'s competitive edge is driven by its...,[The Company’s ability to compete successfully...,The information provided in the context sugges...
3,What are potential risks for Apple if it doesn...,Answer: The provided context does not directly...,[Critical Accounting Estimates\n\nThe preparat...,If Apple fails to meet regulatory expectations...
4,What factors contributed to the 7% boost in iP...,Answer: The 7% boost in iPhone sales was prima...,[Rest of Asia Paciﬁc\n\nRest of Asia Paciﬁc ne...,The 7% boost in iPhone sales was primarily due...


In [132]:
from ragas.metrics import faithfulness, answer_relevancy

parent_doc_faithfulness_metrics = evaluate_dataset(eval_dataset, [faithfulness], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [133]:
parent_doc_answer_relevancy_metrics = evaluate_dataset(eval_dataset, [answer_relevancy], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [134]:
parent_doc_gen_metrics = parent_doc_faithfulness_metrics
parent_doc_gen_metrics["answer_relevancy"] = parent_doc_answer_relevancy_metrics["answer_relevancy"]

parent_doc_gen_metrics.rename(columns={"faithfulness": "parent_doc_faithfulness", "answer_relevancy": "parent_doc_answer_relevancy"}, inplace=True)

overall_gen_metrics = pd.concat([gen_metrics_default, parent_doc_gen_metrics], axis=1)
overall_gen_metrics.describe()

Unnamed: 0,faithfulness,answer_relevancy,parent_doc_faithfulness,parent_doc_answer_relevancy
count,19.0,19.0,19.0,19.0
mean,0.554265,0.714271,0.542544,0.61158
std,0.425378,0.383305,0.446064,0.438532
min,0.0,0.0,0.0,0.0
25%,0.118056,0.788299,0.0,0.0
50%,0.642857,0.887609,0.6,0.840875
75%,1.0,0.930006,1.0,0.960617
max,1.0,1.0,1.0,1.0


## And the same for the retrieval metrics

In [135]:
from ragas.metrics import context_recall, context_precision

parent_doc_context_recall_metrics = evaluate_dataset(eval_dataset, [context_recall], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [136]:
parent_doc_context_precision_metrics = evaluate_dataset(eval_dataset, [context_precision], llm, embeddings)

Evaluating:   0%|          | 0/19 [00:00<?, ?it/s]

In [137]:
parent_doc_ret_metrics = parent_doc_context_recall_metrics
parent_doc_ret_metrics["context_precision"] = parent_doc_context_precision_metrics["context_precision"]

parent_doc_ret_metrics.rename(columns={"context_precision": "parent_doc_context_precision", "context_recall": "parent_doc_context_recall"}, inplace=True)

overall_ret_metrics = pd.concat([ret_metrics_default, parent_doc_ret_metrics], axis=1)

overall_ret_metrics.describe()

Unnamed: 0,context_recall,context_precision,parent_doc_context_recall,parent_doc_context_precision
count,19.0,19.0,19.0,19.0
mean,0.72807,0.815789,0.72807,0.849415
std,0.427491,0.311283,0.427491,0.329494
min,0.0,0.0,0.0,0.0
25%,0.416667,0.708333,0.416667,0.958333
50%,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0


# Analysis

In this case, we observe that the increased context provided by the parent document retriever had a slightly negative effect on the generation metrics, potentially reducing answer clarity via increased information. Basically no effect on context recall and a slightly positive effect on context precision, indicating that the smaller chunks for query comparison helped order the relevant context, but it appears that wasn't a limiting factor from the base case for this test. More conclusive testing would be needed to draw more authoritative conclusions, but this example shows us how to compare options in order to find the highest priority strategies for a given application.

# Review


In this notebook we covered:
- why it's important to have an evaluation framework
- the basic theory of RAGAS
- how to calculate and generate faithfulness, answer_relevancy, context_precision, and context_recall
- code to evaluate two different RAG chains to monitor how using a different retrieval strategy effects performance


# Next steps: end-to-end evaluation

As your pipeline matures and human labeled ground truth data is created the following metrics can be added for increased rigor. These additional metrics can be implemented similarly as the ones showcased above.


## Answer correctness

A weighted average of semantic and factual similarity where weights can be passed as a parameter.

## Answer semantic similarity

Measure distance between ground truth and the generated answer.

#### Example process:
- vectorize the ground truth answer and the generated answer
- compute the cosine similarity.

## Answer factual similarity

Mathematically:

$$
F1\ Score = \frac{TP}{TP + 0.5(FP + FN)}
$$

Where:
TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.

FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.

FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.

#### Example process:

data:
> Ground truth: Einstein was born in 1879 in Germany.
> Generated Answer: Einstein was born in Spain in 1879.

TP: [Einstein was born in 1879]

FP: [Einstein was born in Spain]

FN: [Einstein was born in Germany]

F1 = (1 / 1 + 0.5(1 + 1)) = 1/2

