# Lession: RAG Retrieval Errors

### RAGAS


The ragas library focuses on metrics that are directly applicable to RAG models. The metrics you mentioned are:

- **Context Precision**: Measures the precision of the context retrieved by the RAG model. It evaluates how accurately the retrieved documents or context segments are relevant to the query.

- **Faithfulness**: Assesses how faithfully the generated response represents the information in the retrieved documents. This is crucial in ensuring that the RAG model's output is not only relevant but also accurately reflects the source material.

- **Answer Relevancy**: Evaluates the relevancy of the generated answer to the query. This is essential for tasks like question answering, where the goal is to provide accurate and relevant answers based on the retrieved context.

# Other Metrics

1. Context Recall:
While precision focuses on the relevance of retrieved documents, recall assesses the model's ability to retrieve all relevant documents from the dataset. This is important in contexts where missing key information can lead to incomplete or inaccurate responses.

2. ROGUE:
3. BLUE
4. PERPLEXITY
5. Logprobs from OpenAI

6. Retrieval Diversity:
Evaluates the variety in the retrieved documents. High diversity ensures that the model is not just retrieving similar documents but is considering a wide range of potentially relevant information.

7. Query-Document Alignment:
This involves assessing how well the model's query representation aligns with the document representations in its database. Misalignment can lead to retrieval errors, where the model retrieves documents that are semantically distant from the query.

8. Ranking Accuracy:
Evaluates how accurately the model ranks the retrieved documents in order of relevance. Higher-ranking accuracy ensures that the most relevant documents are considered first for generating responses.

# Lesson: Synthetic Test Data Generation

- We are provided with a dataset to build RAG system on
- We can either manually generate QA from that data set for evaluation purpose, or
- We can synthetically generate QA data. Let's see how we do that.

### Load Documents

In [3]:
# Set the OpenAI API key
import os
import openai

os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
openai.api_key = os.environ["OPENAI_API_KEY"]

In [4]:
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader("../data/paul_graham", show_progress=True)
documents = loader.load()

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:05<00:00,  5.10s/it]


### Generate Synthetic Testing Data

In [5]:
from ragas.testset import TestsetGenerator
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM

In [None]:
# Add custom llms and embeddings
generator_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-3.5-turbo"))
critic_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-4"))
embeddings_model = OpenAIEmbeddings()


In [None]:
# Change resulting question type distribution
testset_distribution = {
    "simple": 0.25,
    "reasoning": 0.5,
    "multi_context": 0.0,
    "conditional": 0.25,
}

In [None]:
# percentage of conversational question
chat_qa = 0.2

test_generator = TestsetGenerator(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings_model=embeddings_model,
    testset_distribution=testset_distribution,
    chat_qa=chat_qa,
)
testset = test_generator.generate(documents, test_size=5)


testset_df= testset.to_pandas()


  0%|                                                                                                                                  | 0/5 [00:00<?, ?it/s][A
 20%|████████████████████████▍                                                                                                 | 1/5 [00:26<01:46, 26.56s/it][A
 60%|█████████████████████████████████████████████████████████████████████████▏                                                | 3/5 [01:21<00:54, 27.41s/it][A
6it [01:36, 13.56s/it]                                                                                                                                       [A
10it [02:09, 10.73s/it][A

In [None]:
testset_df.head()

10it [02:37, 15.76s/it]


Unnamed: 0,question,ground_truth_context,ground_truth,question_type,episode_done
0,What factors influenced the choice to start Y ...,[The prospect of having to stand up in front o...,[The factors that influenced the choice to sta...,conditional,True
1,What prompted Jessica Livingston to compile a ...,[Jessica was surprised by the disparities betw...,[The disparities between her bank's perception...,conditional,True
2,What inspired Jessica Livingston to compile a ...,[One of the guests was someone I didn't know b...,[The information from the given context does n...,simple,True
3,Could online store software be operated on a s...,"[What if we ran the software on the server, an...","[Yes, online store software could be operated ...",conditional,True


In [None]:
testset_df.question[0]

'What factors influenced the choice to start Y Combinator as an angel firm instead of raising a fund and how was the batch model for funding startups developed?'

In [None]:
testset_df.ground_truth_context[1]

["Jessica was surprised by the disparities between her bank's perception of startups and the actuality after meeting friends from the startup world."]

In [None]:
testset_df.ground_truth[0]

['The factors that influenced the choice to start Y Combinator as an angel firm instead of raising a fund were the belief that successful startup founders would be the best sources of seed funding and advice, and the desire to stop procrastinating about angel investing. The batch model for funding startups was developed as a way to fund a bunch of startups at once and gain experience as investors.']

### Compute Responses using RAG

In [None]:
from rag_langchain import RAGLangchain

In [None]:
rag = RAGLangchain(input_dir="./data/paul_graham", persist_dir="./vectordb")

 50%|██████████████████████████████████████████████████████████████████                                                                  | 1/2 [00:00<00:00,  3.34it/s]


In [None]:
rag.get_response("Did paul graham meet Sam altman?")

  warn_deprecated(


{'output_text': "The text doesn't provide information on whether Paul Graham met Sam Altman."}

In [None]:
# Function to get RAG response for each question
def get_rag_response(question):
    try:
        response = rag.get_response(question)
        return response.get('output_text')
    except Exception as e:
        print(f"Error while getting response for question '{question}': {str(e)}")
        return None

In [None]:
%%time

testset_df['llm_response'] = testset_df['question'].apply(get_rag_response)

CPU times: user 108 ms, sys: 63.3 ms, total: 172 ms
Wall time: 17.6 s


In [None]:
testset_df.head()

Unnamed: 0,question,ground_truth_context,ground_truth,question_type,episode_done,llm_response,contexts,answer,ground_truths
0,What factors influenced the choice to start Y ...,[The prospect of having to stand up in front o...,[The factors that influenced the choice to sta...,conditional,True,The decision to start Y Combinator as an angel...,[[The prospect of having to stand up in front ...,[The decision to start Y Combinator as an ange...,[[The factors that influenced the choice to st...
1,What inspired the idea of running software on ...,[One morning as I was lying on this mattress I...,[The idea of running software on the server an...,simple,True,The idea of running software on the server and...,[[One morning as I was lying on this mattress ...,[The inspiration for the idea of running softw...,[[The idea of running software on the server a...
2,What inspired Jessica Livingston to compile a ...,[One of the guests was someone I didn't know b...,[The information from the given context does n...,simple,True,Jessica was inspired to compile a book of inte...,[[One of the guests was someone I didn't know ...,[Jessica was inspired to compile a book of int...,[[The information from the given context does ...


In [None]:
testset_df.ground_truth_context.dtype, testset_df.ground_truth_context[0]

(dtype('O'),
 ["The prospect of having to stand up in front of a group of people and tell them something that won't waste their time is a great spur to the imagination.\nWhen the Harvard Computer Society, the undergrad computer club, asked me to give a talk, I decided I would tell them how to start a startup.\nSo I gave this talk, in the course of which I told them that the best sources of seed funding were successful startup founders, because then they'd be sources of advice too.\nBut afterward it occurred to me that I should really stop procrastinating about angel investing.\nWe'd start our own investment firm and actually implement the ideas we'd been talking about.\nThere were VC firms, which were organized companies with people whose job it was to make investments, but they only did big, million dollar investments.\nAnd there were angels, who did smaller investments, but these were individuals who were usually focused on other things and made investments on the side.\nOur plan was

In [None]:
import pandas as pd
from datasets import Dataset, Features, Sequence, Value

def convert_to_hf_dataset(testset_df):
    """
    Convert a pandas DataFrame into a Hugging Face Dataset with the required format.

    Parameters:
    testset_df (pd.DataFrame): DataFrame containing the data in the format 
                               ['question', 'ground_truth_context', 'ground_truth', 'question_type', 
                                'episode_done', 'llm_response']

    Returns:
    Dataset: A Hugging Face Dataset ready for evaluation.
    """

    # Prepare the DataFrame for conversion
    testset_df['contexts'] = testset_df['ground_truth_context'].apply(lambda x: [x] if isinstance(x, str) else x)
    testset_df['answer'] = testset_df['llm_response'].apply(lambda x: str(x))
    testset_df['ground_truths'] = testset_df['ground_truth'].apply(lambda x: [x] if isinstance(x, str) else x)

    # Define the dataset features using Features
    features = Features({
        'question': Value('string'),
        'contexts': Sequence(Value('string')),
        'answer': Value('string'),
        'ground_truths': Sequence(Value('string')),
    })

    # Convert to Hugging Face Dataset
    hf_dataset = Dataset.from_pandas(testset_df[['question', 'contexts', 'answer', 'ground_truths']], features=features)

    return hf_dataset

In [None]:
final_df = convert_to_hf_dataset(testset_df)

In [None]:
final_df.head()

AttributeError: 'Dataset' object has no attribute 'head'

### Evaluate

In [None]:
from ragas.evaluation import evaluate


evaluate(final_df)

evaluating with [answer_relevancy]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.84s/it]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


evaluating with [context_precision]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.76s/it]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.84s/it]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


evaluating with [context_recall]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.21s/it]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


{'answer_relevancy': 0.9772, 'context_precision': 1.0000, 'faithfulness': 0.1667, 'context_recall': 0.6667}